Generalization Bounds and Rademacher Complexity

Definition

Generalization bound: a probabilistic upper bound on the true risk of a learned hypothesis , given only the empirical risk on training data.

Rademacher complexity of a hypothesis class with respect to a sample :

where are i.i.d. Rademacher (uniform ±1) random variables.

is the expected Rademacher complexity.

Intuition

The generalization gap () is bounded by how well the hypothesis class can “fit random noise”. Rademacher complexity measures exactly this: how much correlation can the best have with a random labelling? If the class is so expressive that it can fit any labelling, the Rademacher complexity is 1 — the class is memorising noise and will not generalise.

This is a data-dependent, distribution-sensitive bound — tighter than worst-case VC bounds.

Formal Description

Empirical risk minimization (ERM) bound: for ERM on a finite class with samples, with probability :

Rademacher generalisation bound: for any , with probability :

Uniform convergence: satisfies uniform convergence if for all :

when . This is the key condition for ERM to work.

McDiarmid’s inequality (stability): if replacing any single sample changes the empirical risk by at most , then by concentration:

Rademacher complexity of linear classes: for with :

This shows that larger weight norms or inputs inflate the complexity, while more data shrinks it.

Structural risk minimization (SRM): instead of fixing , consider a hierarchy . For each level , add a complexity penalty proportional to . Minimise the penalised objective:

This formalises the bias-variance tradeoff: too simple (high bias, low complexity) vs too complex (low bias, high variance).

PAC-Bayes bounds: for a posterior over and prior , with probability :

PAC-Bayes bounds are often the tightest available for deep networks (when is chosen carefully).

Applications

ConceptHow bounds apply
Regularization (, )Constrains weight norms → smaller Rademacher complexity
DropoutEffectively constrains hypothesis class capacity at training time
Early stoppingLimits effective complexity of gradient descent trajectory
Cross-validationEmpirical proxy for true risk; formal bound via -splits
Deep network theoryPAC-Bayes and margin bounds partially explain generalisation

Trade-offs

  • Classical VC and Rademacher bounds are often vacuous for deep networks (the bound > 1); they serve as theoretical insight rather than practical guidance.
  • Data-dependent Rademacher bounds are tighter than VC bounds but still require knowing exactly.
  • PAC-Bayes bounds require choosing a prior ; with a good prior (matching inductive bias), they can be quite tight.
  • The i.i.d. assumption is fundamental; distribution shift invalidates all standard generalisation bounds.