Scaling Laws

Purpose

How model performance scales with compute, data, and parameters — and how to use this to make training decisions.

Chinchilla Scaling Laws

[Hoffmann et al. 2022] showed that for compute-optimal training, tokens ≈ 20× parameters.

Key formula: for a compute budget C (FLOPs), optimal N* ≈ (C / 6)^0.5 and D* ≈ 20 × N*.

Prior (Kaplan et al. 2020) over-parameterised models relative to data; Chinchilla corrected this.

Power Law Relationships

Loss scales as a power law in:

  • Model parameters N: L(N) ∝ N^{-α}
  • Dataset tokens D: L(D) ∝ D^{-β}
  • Compute C: L(C) ∝ C^{-γ}

Exponents (Chinchilla): α ≈ β ≈ 0.5, γ ≈ 0.5.

Emergent Abilities

Certain capabilities (multi-step arithmetic, chain-of-thought, in-context learning) appear abruptly above threshold scales. Not predicted by smooth power laws — may reflect phase transitions or evaluation metric discontinuities.

Inference-Time Compute Scaling

Recent work (OpenAI o1, DeepSeek-R1) shows scaling test-time compute via chain-of-thought/search can substitute for pre-training scale. Separate scaling curve: performance ∝ inference-time FLOPs.

Practical Implications

  • Choose N and D jointly for a compute budget — don’t just maximise N
  • Smaller models trained on more data often outperform larger undertrained models at equivalent FLOPs
  • Training beyond Chinchilla-optimal improves inference efficiency (1× FLOPs, smaller model)

References