Gradient Descent and Variants

Definition

Gradient descent is a first-order iterative optimization algorithm that minimizes an objective function by repeatedly moving in the direction of steepest descent (negative gradient).

Intuition

Imagine standing on a hilly terrain in fog. You can only feel the slope under your feet. Gradient descent says: always step in the direction that goes most steeply downhill. The step size (learning rate) controls how far you stride each step — too large and you overshoot; too small and you take forever to converge.

Formal Description

Batch Gradient Descent

Given $f (θ) = \frac{1}{n} \sum_{i = 1}^{n} ℓ (θ; x_{i})$ , update:

θ_{t + 1} = θ_{t} - η \nabla_{θ} f (θ_{t})

$η > 0$ is the learning rate. Computes gradient over the full dataset — expensive for large $n$ .

Stochastic Gradient Descent (SGD)

In the classical definition, uses one randomly sampled example per update:

θ_{t + 1} = θ_{t} - η \nabla_{θ} ℓ (θ_{t}; x_{i})

This is an unbiased but noisy estimate of the full gradient; the noise can help escape saddle points and some shallow minima.

Complexity per iteration: $O (d)$ for single-example SGD vs $O (n d)$ for batch gradient descent, where $d$ is the parameter dimension and $n$ is the dataset size.

Note on terminology: In modern practice, “SGD” often refers to mini-batch SGD (see below). True single-example SGD is rarely used because it cannot leverage GPU parallelism effectively.

Mini-batch SGD (standard in deep learning): average gradient over a batch of $B$ examples:

θ_{t + 1} = θ_{t} - \frac{η}{B} i \in B \sum \nabla ℓ (θ_{t}; x_{i})

Typical batch size: 32–512. GPU parallelism makes mini-batch efficient.

Convergence Rate

For $L$ -smooth, $m$ -strongly convex $f$ with fixed $η \leq 1/ L$ :

f (θ_{T}) - f (θ^{*}) \leq (1 - \frac{m}{L})^{T} (f (θ_{0}) - f (θ^{*}))

Linear convergence — the condition number $κ = L / m$ governs speed.

For convex (not strongly convex): $O (1/ T)$ convergence.

Momentum

Accumulates an exponential moving average of gradients to dampen oscillations:

v_{t + 1} = β v_{t} + (1 - β) \nabla f (θ_{t})

θ_{t + 1} = θ_{t} - η v_{t + 1}

Typical $β = 0.9$ . Nesterov momentum evaluates the gradient at the “lookahead” point $θ_{t} - β v_{t}$ , improving convergence rate to $O (1/ T^{2})$ for convex problems.

RMSProp

Divides the learning rate by a running average of squared gradients, adapting per-parameter:

s_{t + 1} = ρ s_{t} + (1 - ρ) \nabla f ⊙ \nabla f

θ_{t + 1} = θ_{t} - \frac{η}{s _{t + 1} + ϵ} \nabla f

Effective in non-stationary settings; reduces the learning rate for frequently updated parameters.

Adam (Adaptive Moment Estimation)

Combines momentum (first moment) and RMSProp (second moment) with bias correction:

m_{t + 1} = β_{1} m_{t} + (1 - β_{1}) \nabla f

v_{t + 1} = β_{2} v_{t} + (1 - β_{2}) \nabla f ⊙ \nabla f

\hat{m} = m_{t + 1} / (1 - β_{1}^{t + 1}), \hat{v} = v_{t + 1} / (1 - β_{2}^{t + 1})

θ_{t + 1} = θ_{t} - \frac{η}{v ^ + ϵ} \hat{m}

Defaults: $β_{1} = 0.9$ , $β_{2} = 0.999$ , $ϵ = 1 0^{- 8}$ , $η = 1 0^{- 3}$ .

AdamW: decouples weight decay from the gradient update (correct form):

θ_{t + 1} = θ_{t} - \frac{η}{v ^ + ϵ} \hat{m} - η λ θ_{t}

Learning Rate Schedules

Schedule	Description
Step decay	Multiply by $γ < 1$ every $k$ epochs
Cosine annealing	$η_{t} = η_{m i n} + \frac{1}{2} (η_{m a x} - η_{m i n}) (1 + cos (π t / T))$
Warmup	Linear ramp from 0 to $η_{m a x}$ over first $T_{w}$ steps
Cyclic LR (CLR)	Oscillate between $η_{m i n}$ and $η_{m a x}$

Warmup + cosine decay is standard for transformer training.

Challenges in Non-Convex Optimization

Saddle points: gradient is zero but not a minimum; second-order methods or noise escape these.
Vanishing/exploding gradients: deep networks can have exponentially small/large gradients; mitigated by normalisation, residual connections, gradient clipping.
Learning rate sensitivity: too large diverges, too small converges slowly; use learning rate finders.

Applications

Training any differentiable model: linear regression, logistic regression, neural networks
Specific optimizer choice matters: Adam is default for deep learning; SGD with momentum can generalise better for image models (via implicit regularisation from noise)

Trade-offs

Optimizer	Pros	Cons
SGD (+ momentum)	Good generalisation, well-studied theory	Sensitive to LR, slow on ill-conditioned problems
Adam	Fast convergence, robust to LR	Can generalise worse than SGD for image models; weight decay must use AdamW
RMSProp	Good for RNNs	No bias correction

Notes

Explorer

gradient_descent_optimization