Generalization Bounds and Rademacher Complexity
Definition
Generalization bound: a probabilistic upper bound on the true risk of a learned hypothesis , given only the empirical risk on training data.
Rademacher complexity of a hypothesis class with respect to a sample :
where are i.i.d. Rademacher (uniform ±1) random variables.
is the expected Rademacher complexity.
Intuition
The generalization gap () is bounded by how well the hypothesis class can “fit random noise”. Rademacher complexity measures exactly this: how much correlation can the best have with a random labelling? If the class is so expressive that it can fit any labelling, the Rademacher complexity is 1 — the class is memorising noise and will not generalise.
This is a data-dependent, distribution-sensitive bound — tighter than worst-case VC bounds.
Formal Description
Empirical risk minimization (ERM) bound: for ERM on a finite class with samples, with probability :
Rademacher generalisation bound: for any , with probability :
Uniform convergence: satisfies uniform convergence if for all :
when . This is the key condition for ERM to work.
McDiarmid’s inequality (stability): if replacing any single sample changes the empirical risk by at most , then by concentration:
Rademacher complexity of linear classes: for with :
This shows that larger weight norms or inputs inflate the complexity, while more data shrinks it.
Structural risk minimization (SRM): instead of fixing , consider a hierarchy . For each level , add a complexity penalty proportional to . Minimise the penalised objective:
This formalises the bias-variance tradeoff: too simple (high bias, low complexity) vs too complex (low bias, high variance).
PAC-Bayes bounds: for a posterior over and prior , with probability :
PAC-Bayes bounds are often the tightest available for deep networks (when is chosen carefully).
Applications
| Concept | How bounds apply |
|---|---|
| Regularization (, ) | Constrains weight norms → smaller Rademacher complexity |
| Dropout | Effectively constrains hypothesis class capacity at training time |
| Early stopping | Limits effective complexity of gradient descent trajectory |
| Cross-validation | Empirical proxy for true risk; formal bound via -splits |
| Deep network theory | PAC-Bayes and margin bounds partially explain generalisation |
Trade-offs
- Classical VC and Rademacher bounds are often vacuous for deep networks (the bound > 1); they serve as theoretical insight rather than practical guidance.
- Data-dependent Rademacher bounds are tighter than VC bounds but still require knowing exactly.
- PAC-Bayes bounds require choosing a prior ; with a good prior (matching inductive bias), they can be quite tight.
- The i.i.d. assumption is fundamental; distribution shift invalidates all standard generalisation bounds.
Links
- PAC Learning — sample complexity is derived from generalisation bounds
- VC Dimension — VC bound is a special case; Rademacher is tighter
- Bias-Variance Analysis — practical interpretation of approximation + estimation error
- Probability Theory — McDiarmid, Hoeffding inequalities underpin the proofs
- Convex Optimization — SRM requires solving a regularised objective