VC Dimension
Definition
The Vapnik-Chervonenkis (VC) dimension of a hypothesis class over input space is the size of the largest set that shatters:
A set is shattered by if every possible binary labelling of is realised by some .
Intuition
VC dimension measures the “expressive capacity” of a hypothesis class — how complex a binary pattern can it learn? If you can find a set of points that labels in all possible ways, then . But if for every set of points there’s some labelling can’t produce, then .
Higher VC dimension = more expressive = needs more data to learn without overfitting. The VC dimension is the right notion of “degrees of freedom” for a hypothesis class.
Formal Description
Growth function : the maximum number of distinct labellings that can produce on any points:
Sauer-Shelah lemma: if , then:
This transitions from exponential growth (, when ) to polynomial growth. The key insight: once you have more data than the VC dimension, the class is “effectively finite”, enabling generalisation.
Fundamental theorem of statistical learning: is agnostic PAC learnable if and only if . The sample complexity is:
where .
Standard VC dimensions:
| Hypothesis class | VC dimension |
|---|---|
| Halfspaces in () | |
| Axis-aligned rectangles in | |
| Polynomials of degree in | |
| Linear classifiers (neural net, one hidden layer, parameters) | |
| Finite set $ | \mathcal{H} |
Infinite VC dimension: if no finite exists, is not PAC learnable (e.g., all functions ).
VC bound: with samples and with , with probability :
Applications
| Application | Role of VC dimension |
|---|---|
| Model selection | Higher-capacity models (large ) require more data |
| SVM generalisation | Support vector margin theory bounds generalisation via a normalised VC measure |
| Neural network theory | Networks with parameters have VC dimension — but in practice generalise much better than this bound implies |
| Regularization theory | Constraining model complexity ↔ reducing effective VC dimension |
Trade-offs
- VC dimension is a worst-case measure; it ignores the actual data distribution. Rademacher complexity and PAC-Bayes give tighter distribution-dependent bounds.
- Modern deep networks have enormous VC dimension yet generalise well — classical VC theory doesn’t fully explain this; double-descent and implicit regularisation are active research areas.
- Computing VC dimension exactly is often NP-hard; it is generally used as a theoretical tool rather than a practical quantity.
Links
- PAC Learning — PAC learnability ↔ finite VC dimension
- Generalization Bounds and Rademacher Complexity — tighter, data-dependent bounds
- Bias-Variance Analysis — practical analogue; VC dimension ↔ model complexity
- Probability Theory — Sauer-Shelah lemma uses combinatorial probability