PAC Learning

Definition

A concept class $C$ is probably approximately correct (PAC) learnable if there exists an algorithm $A$ and a polynomial $p (\cdot, \cdot, \cdot, \cdot)$ such that for all $ϵ, δ \in (0, 1)$ , every concept $c \in C$ , and every distribution $D$ over the input space, if $A$ receives at least $m \geq p (1/ ϵ, 1/ δ, n, ∣ C ∣)$ i.i.d. samples from $D$ labelled by $c$ , then with probability at least $1 - δ$ , $A$ outputs a hypothesis $h$ with true error $err (h) \leq ϵ$ .

$ϵ$ : accuracy parameter (maximum tolerated error)
$δ$ : confidence parameter (allowed failure probability)
Probably: succeeds with probability $\geq 1 - δ$
Approximately correct: error $\leq ϵ$

Intuition

PAC learning formalises “what does it mean to learn from data?” Instead of demanding a perfect classifier, it asks: can you, with enough data, produce a classifier that is almost certainly good enough? The framework cleanly separates the statistical question (how many samples?) from the computational question (how fast can you find the hypothesis?).

Think of it as a contract: the learner doesn’t know the true concept $c$ or the data distribution $D$ , but given enough examples, it commits to outputting something close to $c$ with high confidence.

Formal Description

Setup:

Input space $X$ , label space $Y = {0, 1}$
Concept class $C$ (all learnable target functions)
Hypothesis class $H$ (all functions the learner can output; may differ from $C$ )
True risk: $L_{D} (h) = Pr_{x \sim D} [h (x) \neq = c (x)]$
Empirical risk: $L_{S} (h) = \frac{1}{m} \sum_{i = 1}^{m} 1 [h (x_{i}) \neq = y_{i}]$

Realizable PAC learning: assumes $\exists h^{*} \in H$ with $L_{D} (h^{*}) = 0$ (target in the class). Standard result: for finite $∣ H ∣$ , empirical risk minimization (ERM) satisfies the PAC bound with

m \geq \frac{1}{ϵ} (ln ∣ H ∣ + ln \frac{1}{δ})

Agnostic PAC learning: no assumption that $H$ contains a perfect hypothesis. Goal: find $h \in H$ satisfying

L_{D} (h) \leq h^{'} \in H min L_{D} (h^{'}) + ϵ

Agnostic PAC is strictly harder; sample complexity depends on VC dimension (see vc_dimension).

Consistent learner: an algorithm that always returns $h \in H$ with zero empirical error on the training set. For realizable PAC, any consistent learner is a valid PAC learner.

Occam’s razor / compression: simpler hypotheses (shorter description length $lo g ∣ H ∣$ ) require fewer samples. This is the formal justification for preferring simple models.

Computational vs statistical learnability: PAC is a statistical framework; even if a class is statistically PAC learnable, finding the optimal $h$ may be NP-hard (e.g., learning intersections of halfspaces).

Applications

Concept	Connection to PAC
Sample complexity bounds	Lower bound on $m$ to achieve $(ϵ, δ)$ -PAC
Regularization	Constraining $H$ (smaller size) reduces required $m$
Model selection	Simpler models ↔ smaller $
Cross-validation	Empirical substitute for estimating true risk

Trade-offs

PAC bounds are often loose (vacuous for realistic $m$ and $n$ ); non-uniform bounds (Rademacher complexity) are tighter in practice.
The framework assumes i.i.d. data; distribution shift invalidates PAC guarantees.
PAC learning ignores computational cost; a class can be PAC learnable but computationally hard to learn.

Notes

Explorer

pac_learning

PAC Learning

Definition

Intuition

Formal Description

Applications

Trade-offs

Links

Graph View

Table of Contents

Backlinks