Optimization Algorithms

Definition

Iterative algorithms that update model parameters to minimize a training loss. They differ in how the gradient signal is used: how many examples contribute per step, whether past gradients are accumulated, and whether per-parameter learning rates are adapted.

Intuition

Imagine navigating a hilly loss surface. At each step you estimate the local slope and take a step downhill. The key questions are: how accurate is that slope estimate (batch size), how fast do you move (learning rate), and do you build up momentum to glide through valleys (adaptive methods)?

Momentum smooths out oscillations by accumulating a velocity vector — frequent gradient directions build speed; noisy orthogonal fluctuations cancel out.

RMSProp equalizes update sizes across parameters by dividing by the recent RMS of each parameter’s gradient — parameters with large recent gradients get smaller steps and vice versa.

Adam combines both: a momentum term for direction, an RMSProp term for scale normalization, and bias correction for the cold-start problem of zero-initialized moment estimates.

Formal Description

Batch Variants of Gradient Descent

Let $J (θ)$ be the loss. The general update rule is $θ \leftarrow θ - α \nabla_{θ} J$ .

Variant	Batch size $B$	Notes
Full-batch GD	$B = m$	Exact gradient; slow, impractical for large data
Stochastic GD (SGD)	$B = 1$	Very noisy; many updates per epoch; no GPU parallelism benefit
Mini-batch GD	$1 < B < m$	Standard; typical $B \in {32, 64, 128, 256}$ (powers of 2 for GPU efficiency)

Smaller batches act as a regularizer (noise); larger batches converge to sharper minima that may generalize worse.

Learning Rate Schedules

Decaying $α$ over training improves final convergence:

Step decay: $α_{t} = α_{0} \cdot γ^{⌊ t / k ⌋}$
Exponential decay: $α_{t} = α_{0} \cdot e^{- λ t}$
Cosine annealing: $α_{t} = α_{m i n} + \frac{1}{2} (α_{m a x} - α_{m i n}) (1 + cos \frac{π t}{T})$

Linear warmup (ramp $α$ from 0 over the first few thousand steps) is standard before large-scale training to avoid early instability.

Momentum

v_{t} \leftarrow β_{1} v_{t - 1} + (1 - β_{1}) g_{t}, θ \leftarrow θ - α v_{t}

Typical $β_{1} = 0.9$ . Initialise $v_{0} = 0$ .

RMSProp

s_{t} \leftarrow β_{2} s_{t - 1} + (1 - β_{2}) g_{t}^{2}, θ \leftarrow θ - α \frac{g _{t}}{s _{t} + ϵ}

Typical $β_{2} = 0.999$ , $ϵ = 1 0^{- 8}$ .

Adam (Adaptive Moment Estimation)

Maintain 1st (momentum) and 2nd (RMSProp) moment estimates with bias correction:

v_{t} \leftarrow β_{1} v_{t - 1} + (1 - β_{1}) g_{t}

s_{t} \leftarrow β_{2} s_{t - 1} + (1 - β_{2}) g_{t}^{2}

\overset{v}{^}_{t} = \frac{v _{t}}{1 - β _{1}^{t}}, \overset{s}{^}_{t} = \frac{s _{t}}{1 - β _{2}^{t}}

θ \leftarrow θ - α \frac{v ^ _{t}}{s ^ _{t} + ϵ}

Default hyperparameters: $β_{1} = 0.9$ , $β_{2} = 0.999$ , $ϵ = 1 0^{- 8}$ , $α = 1 0^{- 3}$ .

AdamW is a variant that correctly decouples weight decay from the adaptive update (standard in transformer training).

Applications

Adam/AdamW: default for transformers, NLP, most research
SGD + momentum: still competitive in computer vision; can reach lower test error with careful LR schedule tuning
RMSProp: historically popular for RNNs; largely superseded by Adam

Trade-offs

Generalization gap: Adam can generalize worse than SGD+momentum on some image benchmarks due to convergence to sharper minima
Memory: Adam requires two extra moment vectors per parameter — significant for very large models
Epsilon sensitivity: too small $ϵ$ causes instability when $\overset{s}{^}_{t} \approx 0$ ; too large dampens adaptation
LR schedules still matter even with Adam; the adaptive rate does not eliminate the need for a good base LR

Notes

Explorer

optimization_algorithms