Activation Functions

Definition

An activation function $ϕ : R \to R$ applied element-wise after a linear transformation introduces non-linearity into a neural network. Without activation functions, a deep network collapses to a single linear map regardless of depth.

Intuition

A linear layer computes a linear combination of inputs. Stacking many linear layers does nothing that one linear layer couldn’t do. Activation functions break linearity, enabling the network to approximate arbitrary functions (universal approximation theorem). The choice of activation affects gradient flow, training speed, and expressiveness.

Formal Description

Sigmoid:

σ (z) = \frac{1}{1 + e ^{- z}} \in (0, 1)

Output bounded in $(0, 1)$ — historically used for binary classification output and LSTMs. Saturates for $∣ z ∣ ≫ 0$ , causing vanishing gradients in deep networks.

Tanh:

tanh (z) = \frac{e ^{z} - e ^{- z}}{e ^{z} + e ^{- z}} \in (- 1, 1)

Zero-centred (better than sigmoid for hidden units); still saturates. $tanh (z) = 2 σ (2 z) - 1$ .

ReLU (Rectified Linear Unit):

ReLU (z) = max (0, z)

Currently the most widely used. Advantages: sparse activations, no saturation for $z > 0$ , cheap to compute. Problem: dying ReLU — if $z$ is always negative (e.g., large negative bias), the gradient is always zero and the neuron never updates. Can be mitigated by careful initialization.

Leaky ReLU:

LeakyReLU (z) = max (α z, z), α \approx 0.01

Allows a small gradient for $z < 0$ , addressing dying ReLU. Parametric ReLU (PReLU) learns $α$ .

ELU (Exponential Linear Unit):

ELU (z) = {z α (e^{z} - 1) z \geq 0 z < 0

Smooth, negative-side saturation pushes mean activations towards zero, speeding up learning.

GELU (Gaussian Error Linear Unit):

GELU (z) = z \cdot Φ (z) \approx 0.5 z (1 + tanh [2/ π (z + 0.044715 z^{3})])

where $Φ$ is the Gaussian CDF. Smooth, non-monotone; used in BERT, GPT, and most modern transformers. Outperforms ReLU on many benchmarks.

Swish / SiLU:

Swish (z) = z \cdot σ (z)

Self-gated, smooth, unbounded above; used in EfficientNet, some LLMs. Very close to GELU empirically.

Universal Approximation Theorem: a feed-forward network with one hidden layer of sufficient width and a non-polynomial continuous activation function can approximate any continuous function on a compact domain to arbitrary accuracy. Depth analogues (depth-separation theorems) show exponential advantages for deep networks over shallow ones.

Softmax (output layer, multi-class):

softmax (z)_{i} = \frac{e ^{z_{i}}}{\sum _{j} e ^{z_{j}}}

Converts logits to a probability simplex. For numerical stability, compute $softmax (z - max (z))$ .

Applications

Use case	Recommended activation
Hidden layers (general)	ReLU or GELU
Transformers (FFN layers)	GELU, SwiGLU
Recurrent networks (LSTM gates)	Sigmoid, tanh
Binary classification output	Sigmoid
Multi-class output	Softmax
Regression output	Linear (none)

Trade-offs

Activation	Pros	Cons
Sigmoid	Smooth, bounded	Saturates, not zero-centred
ReLU	Fast, sparse	Dying ReLU, not smooth
GELU	Smooth, strong empirical performance	Slower to compute than ReLU
Softmax	Valid probability output	Can saturate (peaked distributions)

Notes

Explorer

activation_functions

Activation Functions

Definition

Intuition

Formal Description

Applications

Trade-offs

Links

Graph View

Table of Contents

Backlinks