Backpropagation Through Time

Definition

Backpropagation applied to unrolled recurrent networks; gradients flow backward through both layers and time steps.

Intuition

An RNN unrolled over steps is a deep network of depth , where all time steps share the same weights. Gradients must traverse the entire time dimension, passing through the same weight matrix repeatedly — this is what causes both vanishing and exploding gradients.

Formal Description

Unrolled RNN: at each step :

Gradient of involves products of Jacobians across all time steps:

Vanishing gradient: when repeatedly, the product shrinks exponentially → early time steps receive near-zero gradient → model cannot learn long-range dependencies.

Exploding gradient: when repeatedly, the product grows exponentially → numerical instability → remedy: gradient clipping by norm.

Truncated BPTT: backpropagate only steps back in time to reduce memory usage and gradient path length. Introduces bias in gradient estimates but is often sufficient in practice.

Applications

  • Training vanilla RNNs, LSTMs, and GRUs on sequence tasks (language modeling, time-series, speech)
  • Any model where dependencies span multiple discrete time steps

Trade-offs

  • Gradient vanishing/explosion makes training long sequences with vanilla RNNs difficult
  • LSTMs and GRUs mitigate vanishing gradients via gating mechanisms (additive updates through the cell state)
  • Transformers avoid BPTT entirely via self-attention, enabling direct gradient flow between any two positions
  • Truncated BPTT reduces memory and stabilizes training but introduces gradient bias and may prevent learning very long-range dependencies