PAC Learning
Definition
A concept class is probably approximately correct (PAC) learnable if there exists an algorithm and a polynomial such that for all , every concept , and every distribution over the input space, if receives at least i.i.d. samples from labelled by , then with probability at least , outputs a hypothesis with true error .
- : accuracy parameter (maximum tolerated error)
- : confidence parameter (allowed failure probability)
- Probably: succeeds with probability
- Approximately correct: error
Intuition
PAC learning formalises “what does it mean to learn from data?” Instead of demanding a perfect classifier, it asks: can you, with enough data, produce a classifier that is almost certainly good enough? The framework cleanly separates the statistical question (how many samples?) from the computational question (how fast can you find the hypothesis?).
Think of it as a contract: the learner doesn’t know the true concept or the data distribution , but given enough examples, it commits to outputting something close to with high confidence.
Formal Description
Setup:
- Input space , label space
- Concept class (all learnable target functions)
- Hypothesis class (all functions the learner can output; may differ from )
- True risk:
- Empirical risk:
Realizable PAC learning: assumes with (target in the class). Standard result: for finite , empirical risk minimization (ERM) satisfies the PAC bound with
Agnostic PAC learning: no assumption that contains a perfect hypothesis. Goal: find satisfying
Agnostic PAC is strictly harder; sample complexity depends on VC dimension (see vc_dimension).
Consistent learner: an algorithm that always returns with zero empirical error on the training set. For realizable PAC, any consistent learner is a valid PAC learner.
Occam’s razor / compression: simpler hypotheses (shorter description length ) require fewer samples. This is the formal justification for preferring simple models.
Computational vs statistical learnability: PAC is a statistical framework; even if a class is statistically PAC learnable, finding the optimal may be NP-hard (e.g., learning intersections of halfspaces).
Applications
| Concept | Connection to PAC |
|---|---|
| Sample complexity bounds | Lower bound on to achieve -PAC |
| Regularization | Constraining (smaller size) reduces required |
| Model selection | Simpler models ↔ smaller $ |
| Cross-validation | Empirical substitute for estimating true risk |
Trade-offs
- PAC bounds are often loose (vacuous for realistic and ); non-uniform bounds (Rademacher complexity) are tighter in practice.
- The framework assumes i.i.d. data; distribution shift invalidates PAC guarantees.
- PAC learning ignores computational cost; a class can be PAC learnable but computationally hard to learn.
Links
- VC Dimension — infinite hypothesis class version of PAC sample complexity
- Generalization Bounds and Rademacher Complexity — tighter, data-dependent bounds
- Bias-Variance Analysis — practical interpretation of approximation-estimation tradeoff
- Bayesian Inference — alternative framework for quantifying learning uncertainty