Optimization Algorithms

Definition

Iterative algorithms that update model parameters to minimize a training loss. They differ in how the gradient signal is used: how many examples contribute per step, whether past gradients are accumulated, and whether per-parameter learning rates are adapted.

Intuition

Imagine navigating a hilly loss surface. At each step you estimate the local slope and take a step downhill. The key questions are: how accurate is that slope estimate (batch size), how fast do you move (learning rate), and do you build up momentum to glide through valleys (adaptive methods)?

Momentum smooths out oscillations by accumulating a velocity vector — frequent gradient directions build speed; noisy orthogonal fluctuations cancel out.

RMSProp equalizes update sizes across parameters by dividing by the recent RMS of each parameter’s gradient — parameters with large recent gradients get smaller steps and vice versa.

Adam combines both: a momentum term for direction, an RMSProp term for scale normalization, and bias correction for the cold-start problem of zero-initialized moment estimates.

Formal Description

Batch Variants of Gradient Descent

Let be the loss. The general update rule is .

VariantBatch size Notes
Full-batch GDExact gradient; slow, impractical for large data
Stochastic GD (SGD)Very noisy; many updates per epoch; no GPU parallelism benefit
Mini-batch GDStandard; typical (powers of 2 for GPU efficiency)

Smaller batches act as a regularizer (noise); larger batches converge to sharper minima that may generalize worse.


Learning Rate Schedules

Decaying over training improves final convergence:

  • Step decay:
  • Exponential decay:
  • Cosine annealing:

Linear warmup (ramp from 0 over the first few thousand steps) is standard before large-scale training to avoid early instability.


Momentum

Typical . Initialise .


RMSProp

Typical , .


Adam (Adaptive Moment Estimation)

Maintain 1st (momentum) and 2nd (RMSProp) moment estimates with bias correction:

Default hyperparameters: , , , .

AdamW is a variant that correctly decouples weight decay from the adaptive update (standard in transformer training).

Applications

  • Adam/AdamW: default for transformers, NLP, most research
  • SGD + momentum: still competitive in computer vision; can reach lower test error with careful LR schedule tuning
  • RMSProp: historically popular for RNNs; largely superseded by Adam

Trade-offs

  • Generalization gap: Adam can generalize worse than SGD+momentum on some image benchmarks due to convergence to sharper minima
  • Memory: Adam requires two extra moment vectors per parameter — significant for very large models
  • Epsilon sensitivity: too small causes instability when ; too large dampens adaptation
  • LR schedules still matter even with Adam; the adaptive rate does not eliminate the need for a good base LR