Activation Functions

Definition

An activation function applied element-wise after a linear transformation introduces non-linearity into a neural network. Without activation functions, a deep network collapses to a single linear map regardless of depth.

Intuition

A linear layer computes a linear combination of inputs. Stacking many linear layers does nothing that one linear layer couldn’t do. Activation functions break linearity, enabling the network to approximate arbitrary functions (universal approximation theorem). The choice of activation affects gradient flow, training speed, and expressiveness.

Formal Description

Sigmoid:

Output bounded in — historically used for binary classification output and LSTMs. Saturates for , causing vanishing gradients in deep networks.

Tanh:

Zero-centred (better than sigmoid for hidden units); still saturates. .

ReLU (Rectified Linear Unit):

Currently the most widely used. Advantages: sparse activations, no saturation for , cheap to compute. Problem: dying ReLU — if is always negative (e.g., large negative bias), the gradient is always zero and the neuron never updates. Can be mitigated by careful initialization.

Leaky ReLU:

Allows a small gradient for , addressing dying ReLU. Parametric ReLU (PReLU) learns .

ELU (Exponential Linear Unit):

Smooth, negative-side saturation pushes mean activations towards zero, speeding up learning.

GELU (Gaussian Error Linear Unit):

where is the Gaussian CDF. Smooth, non-monotone; used in BERT, GPT, and most modern transformers. Outperforms ReLU on many benchmarks.

Swish / SiLU:

Self-gated, smooth, unbounded above; used in EfficientNet, some LLMs. Very close to GELU empirically.

Universal Approximation Theorem: a feed-forward network with one hidden layer of sufficient width and a non-polynomial continuous activation function can approximate any continuous function on a compact domain to arbitrary accuracy. Depth analogues (depth-separation theorems) show exponential advantages for deep networks over shallow ones.

Softmax (output layer, multi-class):

Converts logits to a probability simplex. For numerical stability, compute .

Applications

Use caseRecommended activation
Hidden layers (general)ReLU or GELU
Transformers (FFN layers)GELU, SwiGLU
Recurrent networks (LSTM gates)Sigmoid, tanh
Binary classification outputSigmoid
Multi-class outputSoftmax
Regression outputLinear (none)

Trade-offs

ActivationProsCons
SigmoidSmooth, boundedSaturates, not zero-centred
ReLUFast, sparseDying ReLU, not smooth
GELUSmooth, strong empirical performanceSlower to compute than ReLU
SoftmaxValid probability outputCan saturate (peaked distributions)