Gradient Descent
Definition
Iterative optimization algorithm that updates parameters by following the negative gradient of a loss function. Variants differ in how many training examples are used per gradient estimate: full-batch, mini-batch, or stochastic (single example).
Intuition
Imagine standing on a hilly loss surface. At each step you look around, find the direction of steepest descent, and take a step in that direction. The learning rate controls step size. Too large and you overshoot valleys; too small and convergence is slow. Using mini-batches is like estimating the slope from a random sample of the terrain—noisier but much faster per update.
Formal Description
Cost function. Given training examples with inputs and labels :
General update rule:
Batch variants:
| Variant | Batch size | Gradient estimate | Notes |
|---|---|---|---|
| Full-batch GD | Exact over dataset | Smooth, slow per update | |
| Stochastic GD (SGD) | Single example | Very noisy, many updates | |
| Mini-batch GD | Over examples | Standard in practice |
Mini-batch update using batch :
Mini-batch size trade-offs:
- Typical (powers of 2 for GPU efficiency)
- Smaller : more gradient noise, acts as regularizer, more updates per epoch, slower per step
- Larger : smoother gradients, faster per step, less regularizing noise, higher memory use
- (SGD): maximum noise, no GPU parallelism benefit
- (full-batch): exact gradient, no noise, impractical for large datasets
Learning rate schedules. Decaying over training improves final convergence:
Step decay — reduce by factor every epochs:
Exponential decay:
1/t decay:
Cosine annealing:
Applications
- Training all neural network models (DNNs, CNNs, RNNs, Transformers)
- Any differentiable objective minimization
- Learning rate schedules are standard practice in competitive deep learning
Trade-offs
- Learning rate sensitivity: too high → divergence or oscillation; too low → slow convergence or local minima
- Batch size vs. noise vs. compute: small batches add beneficial noise but reduce GPU utilization; large batches converge to sharper minima (potentially worse generalization)
- Schedule tuning: requires additional hyperparameter choices (decay rate, warmup steps, minimum ); cosine annealing with warm restarts is often a strong default
- Plain GD (no momentum) is rarely used; adaptive_optimizers typically preferred