Data Splits and Distribution

Definition

The practice of partitioning data into train/dev/test sets to enable unbiased model iteration and evaluation, and managing the consequences of distribution mismatch between sets.

Intuition

The dev set guides model selection; the test set estimates real-world performance; they must reflect the actual deployment distribution — otherwise you’re optimizing for the wrong target.

Formal Description

Train/dev/test partition:

Train: used for optimization (gradient updates, parameter fitting)
Dev (validation): used for hyperparameter selection and model comparison; may be evaluated many times
Test: used for final unbiased evaluation — touch only once

Sizes:

Traditional split: 60/20/20 for small datasets
With millions of examples, 98/1/1 is reasonable — dev/test only need to be large enough for statistical significance

Distribution requirements:

Dev and test must come from the same distribution
Train can be augmented from other sources (web crawl, synthetic data, etc.)

Distribution mismatch: if $P_{train} \neq = P_{dev/test}$ , improvements in dev/test error may not reflect real-world improvement; models may overfit to the training distribution.

Training-dev set: a held-out set sampled from the training distribution (not dev/test distribution); used to distinguish variance from mismatch:

Gap	Interpretation
Train error ≈ training-dev error ≪ dev error	Mismatch (not variance)
Train error ≪ training-dev error ≈ dev error	Variance (not mismatch)
Train error ≪ training-dev error ≪ dev error	Both variance and mismatch

Applications

All supervised ML projects; especially important in industry where training data is crawled or synthetic but deployment is user-facing.

Trade-offs

Too-small dev set → noisy model comparisons
Mismatched dev/test → optimizing for the wrong distribution
Leaking test set → overly optimistic reported performance

Notes

Explorer

data_splits_and_distribution

Data Splits and Distribution

Definition

Intuition

Formal Description

Applications

Trade-offs

Links

Graph View

Table of Contents

Backlinks