Layer Normalization

Definition

Layer Normalization (LN) normalises each sample’s activations across the feature dimension. For an activation vector within a single sample:

where is the mean, is the variance, are learned scale and shift parameters, and is a small constant for numerical stability.

Intuition

LN stabilises training by ensuring each sample’s feature vector has approximately zero mean and unit variance before entering the next layer. Unlike Batch Normalization, LN normalises within a single sample over features — so it is unaffected by batch size and can be applied to sequences of varying length.

This makes LN the natural choice for transformers and RNNs, where batch statistics are unreliable (small batches, variable-length sequences, autoregressive generation with batch size 1).

Formal Description

Comparison with Batch Normalization:

PropertyBatch NormLayer Norm
Normalisation axisBatch dimension (per feature)Feature dimension (per sample)
Batch size dependenceYes — breaks with small batchesNo
Sequence/variable lengthProblematicWorks naturally
Primary useCNNs, MLPs, ResNetsTransformers, RNNs
Train/test discrepancyYes (uses running stats at test)No

Pre-LN vs Post-LN:

  • Post-LN (original Transformer, “Add & Norm”): . Residual is added, then normalised. Training can be unstable without warmup.
  • Pre-LN: . LN is applied before the sub-layer. Empirically more stable; used in GPT-2 onward.

Modern large language models (GPT-3, LLaMA, Mistral) typically use Pre-LN with RMSNorm.

RMSNorm (Root Mean Square Layer Normalization): simplifies LN by removing the mean-centering step:

Computationally cheaper (no mean computation), and empirically matches LN on most benchmarks. Used in LLaMA, Gemma, Mistral.

Transformer block with Pre-LN:

y = x + Attention(LN(x))
z = y + FFN(LN(y))

Each sub-layer receives normalised input, improving gradient flow through deep stacks.

Gradient flow argument: without normalisation, activations can grow or shrink exponentially with depth. LN bounds activation norms to approximately 1, keeping gradients well-scaled throughout.

Learnable parameters: each LN layer has trainable parameters ( and ). At initialisation: , (identity map).

Applications

Model familyNormalization used
Original TransformerPost-LN
GPT-2, GPT-3Pre-LN
LLaMA, Mistral, GemmaPre-RMSNorm
BERTPost-LN
Vision Transformer (ViT)Pre-LN
LSTM (modern)LN on gates

Trade-offs

  • LN has higher per-step cost than BN (cannot cache running statistics) but eliminates batch size constraints entirely.
  • Pre-LN is more stable than Post-LN for very deep models but slightly changes the representation (residual path has no normalisation).
  • For very large (e.g., in LLaMA), RMSNorm is a meaningful speed improvement over full LN.