Gradient Descent and Variants

Definition

Gradient descent is a first-order iterative optimization algorithm that minimizes an objective function by repeatedly moving in the direction of steepest descent (negative gradient).

Intuition

Imagine standing on a hilly terrain in fog. You can only feel the slope under your feet. Gradient descent says: always step in the direction that goes most steeply downhill. The step size (learning rate) controls how far you stride each step — too large and you overshoot; too small and you take forever to converge.

Formal Description

Batch Gradient Descent

Given , update:

is the learning rate. Computes gradient over the full dataset — expensive for large .

Stochastic Gradient Descent (SGD)

In the classical definition, uses one randomly sampled example per update:

This is an unbiased but noisy estimate of the full gradient; the noise can help escape saddle points and some shallow minima.

Complexity per iteration: for single-example SGD vs for batch gradient descent, where is the parameter dimension and is the dataset size.

Note on terminology: In modern practice, “SGD” often refers to mini-batch SGD (see below). True single-example SGD is rarely used because it cannot leverage GPU parallelism effectively.

Mini-batch SGD (standard in deep learning): average gradient over a batch of examples:

Typical batch size: 32–512. GPU parallelism makes mini-batch efficient.

Convergence Rate

For -smooth, -strongly convex with fixed :

Linear convergence — the condition number governs speed.

For convex (not strongly convex): convergence.

Momentum

Accumulates an exponential moving average of gradients to dampen oscillations:

Typical . Nesterov momentum evaluates the gradient at the “lookahead” point , improving convergence rate to for convex problems.

RMSProp

Divides the learning rate by a running average of squared gradients, adapting per-parameter:

Effective in non-stationary settings; reduces the learning rate for frequently updated parameters.

Adam (Adaptive Moment Estimation)

Combines momentum (first moment) and RMSProp (second moment) with bias correction:

Defaults: , , , .

AdamW: decouples weight decay from the gradient update (correct form):

Learning Rate Schedules

ScheduleDescription
Step decayMultiply by every epochs
Cosine annealing
WarmupLinear ramp from 0 to over first steps
Cyclic LR (CLR)Oscillate between and

Warmup + cosine decay is standard for transformer training.

Challenges in Non-Convex Optimization

  • Saddle points: gradient is zero but not a minimum; second-order methods or noise escape these.
  • Vanishing/exploding gradients: deep networks can have exponentially small/large gradients; mitigated by normalisation, residual connections, gradient clipping.
  • Learning rate sensitivity: too large diverges, too small converges slowly; use learning rate finders.

Applications

  • Training any differentiable model: linear regression, logistic regression, neural networks
  • Specific optimizer choice matters: Adam is default for deep learning; SGD with momentum can generalise better for image models (via implicit regularisation from noise)

Trade-offs

OptimizerProsCons
SGD (+ momentum)Good generalisation, well-studied theorySensitive to LR, slow on ill-conditioned problems
AdamFast convergence, robust to LRCan generalise worse than SGD for image models; weight decay must use AdamW
RMSPropGood for RNNsNo bias correction