Adaptive Optimizers

Definition

Optimizers that adapt per-parameter learning rates using estimates of gradient moments. Rather than using a single global learning rate, they maintain running statistics of past gradients and scale each parameter update individually.

Intuition

Momentum smooths out oscillations by accumulating a velocity vector: frequent gradient directions build up speed, noisy orthogonal directions cancel out. This helps navigate ravines where curvature differs sharply across dimensions.

RMSProp scales updates by the recent magnitude of gradients per parameter: parameters with large recent gradients get smaller steps, parameters with small gradients get larger steps. This equalizes update sizes across parameters with very different gradient scales.

Adam combines both: it uses a momentum term (1st moment) to smooth direction and an RMSProp-like term (2nd moment) to normalize scale — giving it the benefits of both. Bias correction compensates for the fact that the moment estimates are initialized at zero and are therefore too small early in training.

Formal Description

Let $g_{t} = \nabla_{θ} J_{t} (θ)$ be the gradient at step $t$ .

Momentum:

v_{t} \leftarrow β_{1} v_{t - 1} + (1 - β_{1}) g_{t}

θ \leftarrow θ - α v_{t}

Typical $β_{1} = 0.9$ . Initialise $v_{0} = 0$ .

RMSProp:

s_{t} \leftarrow β_{2} s_{t - 1} + (1 - β_{2}) g_{t}^{2}

θ \leftarrow θ - α \frac{g _{t}}{s _{t} + ϵ}

Typical $β_{2} = 0.999$ , $ϵ = 1 0^{- 8}$ . Initialise $s_{0} = 0$ .

Adam (Adaptive Moment Estimation):

Maintain both 1st and 2nd moment estimates:

v_{t} \leftarrow β_{1} v_{t - 1} + (1 - β_{1}) g_{t} (1st moment / momentum)

s_{t} \leftarrow β_{2} s_{t - 1} + (1 - β_{2}) g_{t}^{2} (2nd moment / RMSProp)

Bias correction (compensates for zero initialisation):

\overset{v}{^}_{t} = \frac{v _{t}}{1 - β _{1}^{t}}, \overset{s}{^}_{t} = \frac{s _{t}}{1 - β _{2}^{t}}

Parameter update:

θ \leftarrow θ - α \frac{v ^ _{t}}{s ^ _{t} + ϵ}

Default hyperparameters: $β_{1} = 0.9$ , $β_{2} = 0.999$ , $ϵ = 1 0^{- 8}$ , $α = 1 0^{- 3}$ .

Applications

Training deep networks across virtually all domains (vision, NLP, RL)
Adam is the default choice in most modern research and production work
Momentum (with SGD) is still widely used in computer vision where careful tuning yields strong generalization
RMSProp was historically popular for RNNs before Adam became dominant

Trade-offs

Generalization: Adam can generalize slightly worse than SGD+momentum on some tasks (e.g., image classification benchmarks); SGD+momentum with a tuned schedule sometimes reaches lower test error
Epsilon sensitivity: the $ϵ$ stabilizer affects update magnitude when $\overset{s}{^}_{t}$ is very small; too small $ϵ$ can cause instability, too large dampens adaptation
Momentum hyperparameter: $β_{1}$ governs how much past gradients matter; high values (0.95+) can cause the optimizer to overshoot in quickly changing loss landscapes
Memory overhead: Adam stores two extra gradient moment vectors per parameter, doubling memory vs. plain SGD (significant for very large models)
Learning rate schedules still matter with Adam; cosine annealing or linear warmup + decay are common choices — see gradient_descent

Notes

Explorer

adaptive_optimizers

Adaptive Optimizers

Definition

Intuition

Formal Description

Applications

Trade-offs

Links

Graph View

Table of Contents

Backlinks