Layer Normalization

Definition

Layer Normalization (LN) normalises each sample’s activations across the feature dimension. For an activation vector $x \in R^{d}$ within a single sample:

LN (x) = γ ⊙ \frac{x - μ}{σ ^{2} + ϵ} + β

where $μ = \frac{1}{d} \sum_{i = 1}^{d} x_{i}$ is the mean, $σ^{2} = \frac{1}{d} \sum_{i = 1}^{d} (x_{i} - μ)^{2}$ is the variance, $γ, β \in R^{d}$ are learned scale and shift parameters, and $ϵ$ is a small constant for numerical stability.

Intuition

LN stabilises training by ensuring each sample’s feature vector has approximately zero mean and unit variance before entering the next layer. Unlike Batch Normalization, LN normalises within a single sample over features — so it is unaffected by batch size and can be applied to sequences of varying length.

This makes LN the natural choice for transformers and RNNs, where batch statistics are unreliable (small batches, variable-length sequences, autoregressive generation with batch size 1).

Formal Description

Comparison with Batch Normalization:

Property	Batch Norm	Layer Norm
Normalisation axis	Batch dimension (per feature)	Feature dimension (per sample)
Batch size dependence	Yes — breaks with small batches	No
Sequence/variable length	Problematic	Works naturally
Primary use	CNNs, MLPs, ResNets	Transformers, RNNs
Train/test discrepancy	Yes (uses running stats at test)	No

Pre-LN vs Post-LN:

Post-LN (original Transformer, “Add & Norm”): $LN (x + F (x))$ . Residual is added, then normalised. Training can be unstable without warmup.
Pre-LN: $x + F (LN (x))$ . LN is applied before the sub-layer. Empirically more stable; used in GPT-2 onward.

Modern large language models (GPT-3, LLaMA, Mistral) typically use Pre-LN with RMSNorm.

RMSNorm (Root Mean Square Layer Normalization): simplifies LN by removing the mean-centering step:

RMSNorm (x) = γ ⊙ \frac{x}{RMS ( x ) + ϵ}, RMS (x) = \frac{1}{d} i = 1 \sum d x_{i}^{2}

Computationally cheaper (no mean computation), and empirically matches LN on most benchmarks. Used in LLaMA, Gemma, Mistral.

Transformer block with Pre-LN:

y = x + Attention(LN(x))
z = y + FFN(LN(y))

Each sub-layer receives normalised input, improving gradient flow through deep stacks.

Gradient flow argument: without normalisation, activations can grow or shrink exponentially with depth. LN bounds activation norms to approximately 1, keeping gradients well-scaled throughout.

Learnable parameters: each LN layer has $2 d$ trainable parameters ( $γ$ and $β$ ). At initialisation: $γ = 1$ , $β = 0$ (identity map).

Applications

Model family	Normalization used
Original Transformer	Post-LN
GPT-2, GPT-3	Pre-LN
LLaMA, Mistral, Gemma	Pre-RMSNorm
BERT	Post-LN
Vision Transformer (ViT)	Pre-LN
LSTM (modern)	LN on gates

Trade-offs

LN has higher per-step cost than BN (cannot cache running statistics) but eliminates batch size constraints entirely.
Pre-LN is more stable than Post-LN for very deep models but slightly changes the representation (residual path has no normalisation).
For very large $d$ (e.g., $d = 4096$ in LLaMA), RMSNorm is a meaningful speed improvement over full LN.

Notes

Explorer

layer_normalization

Layer Normalization

Definition

Intuition

Formal Description

Applications

Trade-offs

Links

Graph View

Table of Contents

Backlinks