Weight Initialization

Definition

The strategy for setting initial parameter values before training begins. Critical for two reasons: breaking symmetry (so neurons learn different features) and controlling gradient magnitudes (so gradients neither vanish nor explode during backprop).

Intuition

All-zero initialization causes symmetry — every neuron in a layer receives identical gradients and learns the same thing forever. Too-large initialization causes exploding gradients as the signal amplifies through layers. Too-small initialization causes vanishing gradients as the signal attenuates. Good initialization keeps activations and gradients at a reasonable scale throughout the entire network depth.

Formal Description

Symmetry breaking: If is all-zeros, is identical for all neurons → they never differentiate. Must use random initialization.

Vanishing/exploding gradients: Gradients scale as ; if weights have , gradients vanish exponentially with depth; if , they explode exponentially.

Xavier/Glorot init: — designed for linear/tanh activations; keeps variance of activations constant across layers.

He init: — designed for ReLU activations; the factor of 2 accounts for the dead half of ReLU (which zeroes out half the activations in expectation).

Biases: Initialized to zero (no symmetry issue for biases since weights already break symmetry).

Applications

Initialization is applied once before training. He init is the standard choice for ReLU networks. Xavier/Glorot is preferred for tanh/sigmoid networks. Residual networks are less sensitive to initialization due to skip connections providing identity paths for gradients.

Trade-offs

Poor initialization can make training practically impossible or extremely slow to recover from. Good initialization does not replace batch normalization for very deep networks — both are typically needed. Orthogonal initialization can help RNNs maintain gradient flow over long sequences.