VC Dimension

Definition

The Vapnik-Chervonenkis (VC) dimension of a hypothesis class $H$ over input space $X$ is the size of the largest set $S \subseteq X$ that $H$ shatters:

VCdim (H) = max {∣ S ∣ : S \subseteq X, H shatters S}

A set $S$ is shattered by $H$ if every possible binary labelling of $S$ is realised by some $h \in H$ .

Intuition

VC dimension measures the “expressive capacity” of a hypothesis class — how complex a binary pattern can it learn? If you can find a set of $d$ points that $H$ labels in all $2^{d}$ possible ways, then $VCdim (H) \geq d$ . But if for every set of $d + 1$ points there’s some labelling $H$ can’t produce, then $VCdim (H) \leq d$ .

Higher VC dimension = more expressive = needs more data to learn without overfitting. The VC dimension is the right notion of “degrees of freedom” for a hypothesis class.

Formal Description

Growth function $Π_{H} (m)$ : the maximum number of distinct labellings that $H$ can produce on any $m$ points:

Π_{H} (m) = S \subseteq X, ∣ S ∣ = m max ∣ {(h (x_{1}), \dots, h (x_{m})) : h \in H} ∣

Sauer-Shelah lemma: if $VCdim (H) = d < \infty$ , then:

Π_{H} (m) \leq i = 0 \sum d (i m) = O (m^{d})

This transitions from exponential growth ( $2^{m}$ , when $m \leq d$ ) to polynomial growth. The key insight: once you have more data than the VC dimension, the class is “effectively finite”, enabling generalisation.

Fundamental theorem of statistical learning: $H$ is agnostic PAC learnable if and only if $VCdim (H) < \infty$ . The sample complexity is:

m ≍ \frac{d + lo g ( 1/ δ )}{ϵ ^{2}} (agnostic) m ≍ \frac{d + lo g ( 1/ δ )}{ϵ} (realizable)

where $d = VCdim (H)$ .

Standard VC dimensions:

Hypothesis class	VC dimension
Halfspaces in $R^{d}$ ( ${x : w^{⊤} x + b \geq 0}$ )	$d + 1$
Axis-aligned rectangles in $R^{d}$	$2 d$
Polynomials of degree $\leq k$ in $R^{d}$	$(k d + k)$
Linear classifiers (neural net, one hidden layer, $p$ parameters)	$O (p lo g p)$
Finite set $	\mathcal{H}

Infinite VC dimension: if no finite $d$ exists, $H$ is not PAC learnable (e.g., all functions ${0, 1}^{X}$ ).

VC bound: with $m$ samples and $H$ with $VCdim (H) = d$ , with probability $\geq 1 - δ$ :

L_{D} (h) \leq L_{S} (h) + \frac{8 d ln ( 2 m / d ) + 8 ln ( 4/ δ )}{m}

Applications

Application	Role of VC dimension
Model selection	Higher-capacity models (large $d$ ) require more data
SVM generalisation	Support vector margin theory bounds generalisation via a normalised VC measure
Neural network theory	Networks with $p$ parameters have $\tilde{O} (p)$ VC dimension — but in practice generalise much better than this bound implies
Regularization theory	Constraining model complexity ↔ reducing effective VC dimension

Trade-offs

VC dimension is a worst-case measure; it ignores the actual data distribution. Rademacher complexity and PAC-Bayes give tighter distribution-dependent bounds.
Modern deep networks have enormous VC dimension yet generalise well — classical VC theory doesn’t fully explain this; double-descent and implicit regularisation are active research areas.
Computing VC dimension exactly is often NP-hard; it is generally used as a theoretical tool rather than a practical quantity.

Notes

Explorer

vc_dimension

VC Dimension

Definition

Intuition

Formal Description

Applications

Trade-offs

Links

Graph View

Table of Contents

Backlinks