Backpropagation Through Time
Definition
Backpropagation applied to unrolled recurrent networks; gradients flow backward through both layers and time steps.
Intuition
An RNN unrolled over steps is a deep network of depth , where all time steps share the same weights. Gradients must traverse the entire time dimension, passing through the same weight matrix repeatedly — this is what causes both vanishing and exploding gradients.
Formal Description
Unrolled RNN: at each step :
Gradient of involves products of Jacobians across all time steps:
Vanishing gradient: when repeatedly, the product shrinks exponentially → early time steps receive near-zero gradient → model cannot learn long-range dependencies.
Exploding gradient: when repeatedly, the product grows exponentially → numerical instability → remedy: gradient clipping by norm.
Truncated BPTT: backpropagate only steps back in time to reduce memory usage and gradient path length. Introduces bias in gradient estimates but is often sufficient in practice.
Applications
- Training vanilla RNNs, LSTMs, and GRUs on sequence tasks (language modeling, time-series, speech)
- Any model where dependencies span multiple discrete time steps
Trade-offs
- Gradient vanishing/explosion makes training long sequences with vanilla RNNs difficult
- LSTMs and GRUs mitigate vanishing gradients via gating mechanisms (additive updates through the cell state)
- Transformers avoid BPTT entirely via self-attention, enabling direct gradient flow between any two positions
- Truncated BPTT reduces memory and stabilizes training but introduces gradient bias and may prevent learning very long-range dependencies
Links
- backpropagation
- weight_initialization
- recurrent_networks (in 02_modeling/sequence_models)