Data Splits and Distribution

Definition

The practice of partitioning data into train/dev/test sets to enable unbiased model iteration and evaluation, and managing the consequences of distribution mismatch between sets.

Intuition

The dev set guides model selection; the test set estimates real-world performance; they must reflect the actual deployment distribution — otherwise you’re optimizing for the wrong target.

Formal Description

Train/dev/test partition:

  • Train: used for optimization (gradient updates, parameter fitting)
  • Dev (validation): used for hyperparameter selection and model comparison; may be evaluated many times
  • Test: used for final unbiased evaluation — touch only once

Sizes:

  • Traditional split: 60/20/20 for small datasets
  • With millions of examples, 98/1/1 is reasonable — dev/test only need to be large enough for statistical significance

Distribution requirements:

  • Dev and test must come from the same distribution
  • Train can be augmented from other sources (web crawl, synthetic data, etc.)

Distribution mismatch: if , improvements in dev/test error may not reflect real-world improvement; models may overfit to the training distribution.

Training-dev set: a held-out set sampled from the training distribution (not dev/test distribution); used to distinguish variance from mismatch:

GapInterpretation
Train error ≈ training-dev error ≪ dev errorMismatch (not variance)
Train error ≪ training-dev error ≈ dev errorVariance (not mismatch)
Train error ≪ training-dev error ≪ dev errorBoth variance and mismatch

Applications

All supervised ML projects; especially important in industry where training data is crawled or synthetic but deployment is user-facing.

Trade-offs

  • Too-small dev set → noisy model comparisons
  • Mismatched dev/test → optimizing for the wrong distribution
  • Leaking test set → overly optimistic reported performance