Weight Initialization
Definition
The strategy for setting initial parameter values before training begins. Critical for two reasons: breaking symmetry (so neurons learn different features) and controlling gradient magnitudes (so gradients neither vanish nor explode during backprop).
Intuition
All-zero initialization causes symmetry — every neuron in a layer receives identical gradients and learns the same thing forever. Too-large initialization causes exploding gradients as the signal amplifies through layers. Too-small initialization causes vanishing gradients as the signal attenuates. Good initialization keeps activations and gradients at a reasonable scale throughout the entire network depth.
Formal Description
Symmetry breaking: If is all-zeros, is identical for all neurons → they never differentiate. Must use random initialization.
Vanishing/exploding gradients: Gradients scale as ; if weights have , gradients vanish exponentially with depth; if , they explode exponentially.
Xavier/Glorot init: — designed for linear/tanh activations; keeps variance of activations constant across layers.
He init: — designed for ReLU activations; the factor of 2 accounts for the dead half of ReLU (which zeroes out half the activations in expectation).
Biases: Initialized to zero (no symmetry issue for biases since weights already break symmetry).
Applications
Initialization is applied once before training. He init is the standard choice for ReLU networks. Xavier/Glorot is preferred for tanh/sigmoid networks. Residual networks are less sensitive to initialization due to skip connections providing identity paths for gradients.
Trade-offs
Poor initialization can make training practically impossible or extremely slow to recover from. Good initialization does not replace batch normalization for very deep networks — both are typically needed. Orthogonal initialization can help RNNs maintain gradient flow over long sequences.