Bayesian Inference

Definition

A statistical framework that treats model parameters as random variables and updates a prior belief distribution with observed data to produce a posterior distribution using Bayes’ theorem.

Intuition

Frequentist inference asks “what does this data say about a fixed unknown parameter?” Bayesian inference asks “after seeing this data, what should I believe about the parameter?” The prior encodes what is known before the data; the likelihood encodes what the data says; Bayes’ theorem combines them into the posterior.

Formal Description

Bayes’ theorem for inference:

The evidence normalises the posterior but is often intractable.

Maximum Likelihood Estimation (MLE): finds the that maximises the likelihood, ignoring the prior:

Maximum A Posteriori (MAP): finds the mode of the posterior:

MAP with a Gaussian prior is equivalent to L2-regularised MLE (the prior acts as a regulariser).

Posterior predictive distribution:

Integrates over parameter uncertainty — more principled than plugging in a point estimate.

Conjugate Priors

A prior is conjugate to a likelihood if the posterior is in the same distributional family as the prior. This makes updates analytical:

LikelihoodConjugate priorPosterior
Bernoulli()Beta()Beta()
Poisson()Gamma()Gamma()
Gaussian (known )Gaussian()Gaussian
Categorical()Dirichlet()Dirichlet()

Beta-Binomial example: after observing successes in trials with prior Beta():

The posterior mean is , which interpolates between the prior mean and the MLE as grows.

Approximate Inference

When the posterior is intractable:

  • MCMC (Markov Chain Monte Carlo): sample from the posterior (Metropolis-Hastings, Hamiltonian MC via Stan/PyMC).
  • Variational inference (VI): approximate the posterior with a tractable family by minimising .
  • Laplace approximation: fit a Gaussian at the MAP estimate.

Applications

  • Naive Bayes classifier: uses class-conditional independence to compute .
  • Bayesian linear regression: places a Gaussian prior on weights; posterior is analytical and gives uncertainty estimates over predictions.
  • Topic models (LDA): Dirichlet-Multinomial hierarchy.
  • Bayesian hyperparameter optimisation (Gaussian processes): prior over functions, update with evaluated points.
  • Online learning: sequential Bayesian updates as new data arrives.

Trade-offs

ApproachAdvantageLimitation
Full BayesianCalibrated uncertainty, principledExpensive (MCMC/VI), prior choice matters
MAPCheap, equivalent to regularised MLEPoint estimate, no uncertainty
MLESimplest, unbiased in large Overconfident, no regularisation
  • Prior choice can dominate with small datasets; with large datasets the likelihood overwhelms the prior and results converge to MLE.