Bayesian Inference
Definition
A statistical framework that treats model parameters as random variables and updates a prior belief distribution with observed data to produce a posterior distribution using Bayes’ theorem.
Intuition
Frequentist inference asks “what does this data say about a fixed unknown parameter?” Bayesian inference asks “after seeing this data, what should I believe about the parameter?” The prior encodes what is known before the data; the likelihood encodes what the data says; Bayes’ theorem combines them into the posterior.
Formal Description
Bayes’ theorem for inference:
The evidence normalises the posterior but is often intractable.
Maximum Likelihood Estimation (MLE): finds the that maximises the likelihood, ignoring the prior:
Maximum A Posteriori (MAP): finds the mode of the posterior:
MAP with a Gaussian prior is equivalent to L2-regularised MLE (the prior acts as a regulariser).
Posterior predictive distribution:
Integrates over parameter uncertainty — more principled than plugging in a point estimate.
Conjugate Priors
A prior is conjugate to a likelihood if the posterior is in the same distributional family as the prior. This makes updates analytical:
| Likelihood | Conjugate prior | Posterior |
|---|---|---|
| Bernoulli() | Beta() | Beta() |
| Poisson() | Gamma() | Gamma() |
| Gaussian (known ) | Gaussian() | Gaussian |
| Categorical() | Dirichlet() | Dirichlet() |
Beta-Binomial example: after observing successes in trials with prior Beta():
The posterior mean is , which interpolates between the prior mean and the MLE as grows.
Approximate Inference
When the posterior is intractable:
- MCMC (Markov Chain Monte Carlo): sample from the posterior (Metropolis-Hastings, Hamiltonian MC via Stan/PyMC).
- Variational inference (VI): approximate the posterior with a tractable family by minimising .
- Laplace approximation: fit a Gaussian at the MAP estimate.
Applications
- Naive Bayes classifier: uses class-conditional independence to compute .
- Bayesian linear regression: places a Gaussian prior on weights; posterior is analytical and gives uncertainty estimates over predictions.
- Topic models (LDA): Dirichlet-Multinomial hierarchy.
- Bayesian hyperparameter optimisation (Gaussian processes): prior over functions, update with evaluated points.
- Online learning: sequential Bayesian updates as new data arrives.
Trade-offs
| Approach | Advantage | Limitation |
|---|---|---|
| Full Bayesian | Calibrated uncertainty, principled | Expensive (MCMC/VI), prior choice matters |
| MAP | Cheap, equivalent to regularised MLE | Point estimate, no uncertainty |
| MLE | Simplest, unbiased in large | Overconfident, no regularisation |
- Prior choice can dominate with small datasets; with large datasets the likelihood overwhelms the prior and results converge to MLE.