Generalization Bounds and Rademacher Complexity

Definition

Generalization bound: a probabilistic upper bound on the true risk $L_{D} (h)$ of a learned hypothesis $h$ , given only the empirical risk $L_{S} (h)$ on training data.

Rademacher complexity of a hypothesis class $H$ with respect to a sample $S = (x_{1}, \dots, x_{m})$ :

\hat{R}_{S} (H) = E_{σ} [h \in H sup \frac{1}{m} i = 1 \sum m σ_{i} h (x_{i})]

where $σ_{i} \in {- 1, + 1}$ are i.i.d. Rademacher (uniform ±1) random variables.

$R_{m} (H) = E_{S} [\hat{R}_{S} (H)]$ is the expected Rademacher complexity.

Intuition

The generalization gap ( $L_{D} (h) - L_{S} (h)$ ) is bounded by how well the hypothesis class can “fit random noise”. Rademacher complexity measures exactly this: how much correlation can the best $h \in H$ have with a random $\pm 1$ labelling? If the class is so expressive that it can fit any labelling, the Rademacher complexity is 1 — the class is memorising noise and will not generalise.

This is a data-dependent, distribution-sensitive bound — tighter than worst-case VC bounds.

Formal Description

Empirical risk minimization (ERM) bound: for ERM on a finite class $∣ H ∣ = N$ with $m$ samples, with probability $\geq 1 - δ$ :

L_{D} (h_{ERM}) \leq L_{S} (h_{ERM}) + \frac{ln N + ln ( 2/ δ )}{2 m}

Rademacher generalisation bound: for any $h \in H$ , with probability $\geq 1 - δ$ :

L_{D} (h) \leq L_{S} (h) + 2 R_{m} (H) + \frac{ln ( 1/ δ )}{2 m}

Uniform convergence: $H$ satisfies uniform convergence if for all $ϵ, δ > 0$ :

S Pr [h \in H sup ∣ L_{D} (h) - L_{S} (h) ∣ > ϵ] < δ

when $m \geq m_{H} (ϵ, δ)$ . This is the key condition for ERM to work.

McDiarmid’s inequality (stability): if replacing any single sample changes the empirical risk by at most $c$ , then by concentration:

Pr [L_{S} (h) - E [L_{S} (h)] > t] \leq e^{- 2 t^{2} / (m c^{2})}

Rademacher complexity of linear classes: for $H = {x \mapsto w^{⊤} x : ∥ w ∥_{2} \leq B}$ with $∥ x_{i} ∥_{2} \leq C$ :

R_{m} (H) \leq \frac{BC}{m}

This shows that larger weight norms or inputs inflate the complexity, while more data shrinks it.

Structural risk minimization (SRM): instead of fixing $H$ , consider a hierarchy $H_{1} \subseteq H_{2} \subseteq \dots$ . For each level $k$ , add a complexity penalty $pen (k)$ proportional to $d_{k} / m$ . Minimise the penalised objective:

\hat{k} = ar g k min [L_{S} (h_{k}) + pen (k)]

This formalises the bias-variance tradeoff: too simple (high bias, low complexity) vs too complex (low bias, high variance).

PAC-Bayes bounds: for a posterior $Q$ over $H$ and prior $P$ , with probability $\geq 1 - δ$ :

E_{h \sim Q} [L_{D} (h)] \leq E_{h \sim Q} [L_{S} (h)] + \frac{KL ( Q ∥ P ) + ln ( m / δ )}{2 ( m - 1 )}

PAC-Bayes bounds are often the tightest available for deep networks (when $P$ is chosen carefully).

Applications

Concept	How bounds apply
Regularization ( $L_{2}$ , $L_{1}$ )	Constrains weight norms → smaller Rademacher complexity
Dropout	Effectively constrains hypothesis class capacity at training time
Early stopping	Limits effective complexity of gradient descent trajectory
Cross-validation	Empirical proxy for true risk; formal bound via $m$ -splits
Deep network theory	PAC-Bayes and margin bounds partially explain generalisation

Trade-offs

Classical VC and Rademacher bounds are often vacuous for deep networks (the bound > 1); they serve as theoretical insight rather than practical guidance.
Data-dependent Rademacher bounds are tighter than VC bounds but still require knowing $H$ exactly.
PAC-Bayes bounds require choosing a prior $P$ ; with a good prior (matching inductive bias), they can be quite tight.
The i.i.d. assumption is fundamental; distribution shift invalidates all standard generalisation bounds.

Notes

Explorer

generalization_bounds

Generalization Bounds and Rademacher Complexity

Definition

Intuition

Formal Description

Applications

Trade-offs

Links

Graph View

Table of Contents

Backlinks