Activation Functions
Definition
An activation function applied element-wise after a linear transformation introduces non-linearity into a neural network. Without activation functions, a deep network collapses to a single linear map regardless of depth.
Intuition
A linear layer computes a linear combination of inputs. Stacking many linear layers does nothing that one linear layer couldn’t do. Activation functions break linearity, enabling the network to approximate arbitrary functions (universal approximation theorem). The choice of activation affects gradient flow, training speed, and expressiveness.
Formal Description
Sigmoid:
Output bounded in — historically used for binary classification output and LSTMs. Saturates for , causing vanishing gradients in deep networks.
Tanh:
Zero-centred (better than sigmoid for hidden units); still saturates. .
ReLU (Rectified Linear Unit):
Currently the most widely used. Advantages: sparse activations, no saturation for , cheap to compute. Problem: dying ReLU — if is always negative (e.g., large negative bias), the gradient is always zero and the neuron never updates. Can be mitigated by careful initialization.
Leaky ReLU:
Allows a small gradient for , addressing dying ReLU. Parametric ReLU (PReLU) learns .
ELU (Exponential Linear Unit):
Smooth, negative-side saturation pushes mean activations towards zero, speeding up learning.
GELU (Gaussian Error Linear Unit):
where is the Gaussian CDF. Smooth, non-monotone; used in BERT, GPT, and most modern transformers. Outperforms ReLU on many benchmarks.
Swish / SiLU:
Self-gated, smooth, unbounded above; used in EfficientNet, some LLMs. Very close to GELU empirically.
Universal Approximation Theorem: a feed-forward network with one hidden layer of sufficient width and a non-polynomial continuous activation function can approximate any continuous function on a compact domain to arbitrary accuracy. Depth analogues (depth-separation theorems) show exponential advantages for deep networks over shallow ones.
Softmax (output layer, multi-class):
Converts logits to a probability simplex. For numerical stability, compute .
Applications
| Use case | Recommended activation |
|---|---|
| Hidden layers (general) | ReLU or GELU |
| Transformers (FFN layers) | GELU, SwiGLU |
| Recurrent networks (LSTM gates) | Sigmoid, tanh |
| Binary classification output | Sigmoid |
| Multi-class output | Softmax |
| Regression output | Linear (none) |
Trade-offs
| Activation | Pros | Cons |
|---|---|---|
| Sigmoid | Smooth, bounded | Saturates, not zero-centred |
| ReLU | Fast, sparse | Dying ReLU, not smooth |
| GELU | Smooth, strong empirical performance | Slower to compute than ReLU |
| Softmax | Valid probability output | Can saturate (peaked distributions) |
Links
- Weight Initialization — initialization must account for activation gain (Kaiming for ReLU, Xavier for sigmoid/tanh)
- Backpropagation — gradient of activation must be non-zero for signal to propagate
- Batch Normalization — often placed before or after activation; order matters
- Gradient Descent — activation choice directly affects gradient flow