Gradient Checking

Definition

Numerically approximates partial derivatives using the centered difference formula and compares them against analytical gradients produced by backpropagation. Used exclusively as a debugging tool to verify a backprop implementation is correct.

Intuition

Backpropagation is easy to implement incorrectly — sign errors, missing terms, wrong tensor shapes, and off-by-one errors all produce gradients that look plausible but are subtly wrong. Gradient checking acts as a unit test: if the numerical and analytical gradients agree to high precision, the implementation is almost certainly correct.

Formal Description

Centered difference approximation for parameter $θ_{i}$ :

\frac{\partial J}{\partial θ _{i}} \approx \frac{J ( θ + ϵ e _{i} ) - J ( θ - ϵ e _{i} )}{2 ϵ}

where $e_{i}$ is the $i$ -th standard basis vector and $ϵ \approx 1 0^{- 7}$ .

The centered difference has $O (ϵ^{2})$ error vs. $O (ϵ)$ for the one-sided difference, making it significantly more accurate.

Relative error metric:

err = \frac{∥ \nabla _{approx} - \nabla _{backprop} ∥}{∥ \nabla _{approx} ∥ + ∥ \nabla _{backprop} ∥}

Interpretation:

Relative error	Assessment
$< 1 0^{- 7}$	Great — implementation very likely correct
$1 0^{- 7}$ – $1 0^{- 5}$	Acceptable, borderline
$> 1 0^{- 5}$	Likely a bug in backprop

Applications

Debugging new backpropagation implementations before training
Catching sign errors, missing regularization terms, wrong matrix transpositions
Verifying custom layers or loss functions in deep learning frameworks

Trade-offs

Computational cost: requires $O (d)$ forward passes where $d$ is the number of parameters — completely impractical for training; use only on small models or a parameter subset
Incompatible with dropout: stochastic layers produce different values on each forward pass, making the numerical approximation meaningless; disable dropout during checks
Must include regularization: $J$ in the numerical approximation must match $J$ used in backprop exactly, including any $L_{2}$ penalty
Numerical precision: very small $ϵ$ causes floating-point cancellation; very large $ϵ$ increases approximation error; $ϵ = 1 0^{- 7}$ is the standard choice

Notes

Explorer

gradient_checking

Gradient Checking

Definition

Intuition

Formal Description

Applications

Trade-offs

Links

Graph View

Table of Contents

Backlinks