Gradient Descent

Definition

Iterative optimization algorithm that updates parameters by following the negative gradient of a loss function. Variants differ in how many training examples are used per gradient estimate: full-batch, mini-batch, or stochastic (single example).

Intuition

Imagine standing on a hilly loss surface. At each step you look around, find the direction of steepest descent, and take a step in that direction. The learning rate controls step size. Too large and you overshoot valleys; too small and convergence is slow. Using mini-batches is like estimating the slope from a random sample of the terrain—noisier but much faster per update.

Formal Description

Cost function. Given training examples with inputs and labels :

General update rule:

Batch variants:

VariantBatch size Gradient estimateNotes
Full-batch GDExact over datasetSmooth, slow per update
Stochastic GD (SGD)Single exampleVery noisy, many updates
Mini-batch GDOver examplesStandard in practice

Mini-batch update using batch :

Mini-batch size trade-offs:

  • Typical (powers of 2 for GPU efficiency)
  • Smaller : more gradient noise, acts as regularizer, more updates per epoch, slower per step
  • Larger : smoother gradients, faster per step, less regularizing noise, higher memory use
  • (SGD): maximum noise, no GPU parallelism benefit
  • (full-batch): exact gradient, no noise, impractical for large datasets

Learning rate schedules. Decaying over training improves final convergence:

Step decay — reduce by factor every epochs:

Exponential decay:

1/t decay:

Cosine annealing:

Applications

  • Training all neural network models (DNNs, CNNs, RNNs, Transformers)
  • Any differentiable objective minimization
  • Learning rate schedules are standard practice in competitive deep learning

Trade-offs

  • Learning rate sensitivity: too high → divergence or oscillation; too low → slow convergence or local minima
  • Batch size vs. noise vs. compute: small batches add beneficial noise but reduce GPU utilization; large batches converge to sharper minima (potentially worse generalization)
  • Schedule tuning: requires additional hyperparameter choices (decay rate, warmup steps, minimum ); cosine annealing with warm restarts is often a strong default
  • Plain GD (no momentum) is rarely used; adaptive_optimizers typically preferred