Backpropagation

Definition

Reverse-mode automatic differentiation of the loss over a computation graph; efficiently computes all parameter gradients in one backward pass.

Intuition

The computation graph records every operation during the forward pass. Gradients then flow backward through this graph via the chain rule: each node multiplies the upstream gradient by its local derivative. Because many paths share intermediate nodes, reverse-mode AD reuses computations and computes all gradients in roughly one forward-pass worth of work.

Formal Description

Computation graph: a DAG where nodes are operations or values. If $u = g (v)$ , then:

$\frac{\partial J}{\partial v} = \frac{\partial J}{\partial u} \cdot g^{'} (v) (upstream gradient \times local gradient)$

For nodes with multiple outputs, contributions from all downstream paths are summed.

Forward pass for layer $l$ :

$z^{[l]} = W^{[l]} a^{[l - 1]} + b^{[l]}, a^{[l]} = g^{[l]} (z^{[l]})$

Backward pass for layer $l$ :

$d z^{[l]} = d a^{[l]} ⊙ g^{' [l]} (z^{[l]})$

$d W^{[l]} = \frac{1}{m} d z^{[l]} (a^{[l - 1]})^{⊤}$

$d b^{[l]} = \frac{1}{m} \sum d z^{[l]}$

$d a^{[l - 1]} = (W^{[l]})^{⊤} d z^{[l]}$

Computational complexity: one backward pass costs approximately the same as one forward pass. Forward-mode AD would require one pass per input dimension — infeasible for models with millions of parameters.

Memory cost: all intermediate activations $a^{[l]}$ must be retained from the forward pass to compute $d W^{[l]}$ during backprop. This is the dominant memory cost during training.

Applications

Training all gradient-based models (neural networks, linear models, etc.)
Automatic differentiation frameworks (PyTorch autograd, JAX, TensorFlow) implement this automatically by tracing the computation graph

Trade-offs

Requires storing all intermediate activations — memory cost scales with depth and batch size; gradient checkpointing trades recomputation for memory
Numerical stability degrades in very deep networks due to vanishing/exploding gradients → use gradient clipping, normalization, and careful initialization
Bugs most commonly manifest as shape mismatches or sign errors; use gradient_checking to verify implementations

Notes

Explorer

backpropagation

Backpropagation

Definition

Intuition

Formal Description

Applications

Trade-offs

Links

Graph View

Table of Contents

Backlinks