Gradient Descent and Variants
Definition
Gradient descent is a first-order iterative optimization algorithm that minimizes an objective function by repeatedly moving in the direction of steepest descent (negative gradient).
Intuition
Imagine standing on a hilly terrain in fog. You can only feel the slope under your feet. Gradient descent says: always step in the direction that goes most steeply downhill. The step size (learning rate) controls how far you stride each step — too large and you overshoot; too small and you take forever to converge.
Formal Description
Batch Gradient Descent
Given , update:
is the learning rate. Computes gradient over the full dataset — expensive for large .
Stochastic Gradient Descent (SGD)
In the classical definition, uses one randomly sampled example per update:
This is an unbiased but noisy estimate of the full gradient; the noise can help escape saddle points and some shallow minima.
Complexity per iteration: for single-example SGD vs for batch gradient descent, where is the parameter dimension and is the dataset size.
Note on terminology: In modern practice, “SGD” often refers to mini-batch SGD (see below). True single-example SGD is rarely used because it cannot leverage GPU parallelism effectively.
Mini-batch SGD (standard in deep learning): average gradient over a batch of examples:
Typical batch size: 32–512. GPU parallelism makes mini-batch efficient.
Convergence Rate
For -smooth, -strongly convex with fixed :
Linear convergence — the condition number governs speed.
For convex (not strongly convex): convergence.
Momentum
Accumulates an exponential moving average of gradients to dampen oscillations:
Typical . Nesterov momentum evaluates the gradient at the “lookahead” point , improving convergence rate to for convex problems.
RMSProp
Divides the learning rate by a running average of squared gradients, adapting per-parameter:
Effective in non-stationary settings; reduces the learning rate for frequently updated parameters.
Adam (Adaptive Moment Estimation)
Combines momentum (first moment) and RMSProp (second moment) with bias correction:
Defaults: , , , .
AdamW: decouples weight decay from the gradient update (correct form):
Learning Rate Schedules
| Schedule | Description |
|---|---|
| Step decay | Multiply by every epochs |
| Cosine annealing | |
| Warmup | Linear ramp from 0 to over first steps |
| Cyclic LR (CLR) | Oscillate between and |
Warmup + cosine decay is standard for transformer training.
Challenges in Non-Convex Optimization
- Saddle points: gradient is zero but not a minimum; second-order methods or noise escape these.
- Vanishing/exploding gradients: deep networks can have exponentially small/large gradients; mitigated by normalisation, residual connections, gradient clipping.
- Learning rate sensitivity: too large diverges, too small converges slowly; use learning rate finders.
Applications
- Training any differentiable model: linear regression, logistic regression, neural networks
- Specific optimizer choice matters: Adam is default for deep learning; SGD with momentum can generalise better for image models (via implicit regularisation from noise)
Trade-offs
| Optimizer | Pros | Cons |
|---|---|---|
| SGD (+ momentum) | Good generalisation, well-studied theory | Sensitive to LR, slow on ill-conditioned problems |
| Adam | Fast convergence, robust to LR | Can generalise worse than SGD for image models; weight decay must use AdamW |
| RMSProp | Good for RNNs | No bias correction |