Adaptive Optimizers

Definition

Optimizers that adapt per-parameter learning rates using estimates of gradient moments. Rather than using a single global learning rate, they maintain running statistics of past gradients and scale each parameter update individually.

Intuition

Momentum smooths out oscillations by accumulating a velocity vector: frequent gradient directions build up speed, noisy orthogonal directions cancel out. This helps navigate ravines where curvature differs sharply across dimensions.

RMSProp scales updates by the recent magnitude of gradients per parameter: parameters with large recent gradients get smaller steps, parameters with small gradients get larger steps. This equalizes update sizes across parameters with very different gradient scales.

Adam combines both: it uses a momentum term (1st moment) to smooth direction and an RMSProp-like term (2nd moment) to normalize scale — giving it the benefits of both. Bias correction compensates for the fact that the moment estimates are initialized at zero and are therefore too small early in training.

Formal Description

Let be the gradient at step .


Momentum:

Typical . Initialise .


RMSProp:

Typical , . Initialise .


Adam (Adaptive Moment Estimation):

Maintain both 1st and 2nd moment estimates:

Bias correction (compensates for zero initialisation):

Parameter update:

Default hyperparameters: , , , .

Applications

  • Training deep networks across virtually all domains (vision, NLP, RL)
  • Adam is the default choice in most modern research and production work
  • Momentum (with SGD) is still widely used in computer vision where careful tuning yields strong generalization
  • RMSProp was historically popular for RNNs before Adam became dominant

Trade-offs

  • Generalization: Adam can generalize slightly worse than SGD+momentum on some tasks (e.g., image classification benchmarks); SGD+momentum with a tuned schedule sometimes reaches lower test error
  • Epsilon sensitivity: the stabilizer affects update magnitude when is very small; too small can cause instability, too large dampens adaptation
  • Momentum hyperparameter: governs how much past gradients matter; high values (0.95+) can cause the optimizer to overshoot in quickly changing loss landscapes
  • Memory overhead: Adam stores two extra gradient moment vectors per parameter, doubling memory vs. plain SGD (significant for very large models)
  • Learning rate schedules still matter with Adam; cosine annealing or linear warmup + decay are common choices — see gradient_descent