Bayesian Inference

Definition

A statistical framework that treats model parameters as random variables and updates a prior belief distribution with observed data to produce a posterior distribution using Bayes’ theorem.

Intuition

Frequentist inference asks “what does this data say about a fixed unknown parameter?” Bayesian inference asks “after seeing this data, what should I believe about the parameter?” The prior encodes what is known before the data; the likelihood encodes what the data says; Bayes’ theorem combines them into the posterior.

Formal Description

Bayes’ theorem for inference:

posterior P (θ ∣ D) = \frac{likelihood P ( D ∣ θ ) \cdot prior P ( θ )}{evidence P ( D )}

The evidence $P (D) = \int P (D ∣ θ) P (θ) d θ$ normalises the posterior but is often intractable.

Maximum Likelihood Estimation (MLE): finds the $θ$ that maximises the likelihood, ignoring the prior:

\hat{θ}_{MLE} = ar g θ max P (D ∣ θ) = ar g θ max i \sum lo g p (x_{i} ∣ θ)

Maximum A Posteriori (MAP): finds the mode of the posterior:

\hat{θ}_{MAP} = ar g θ max lo g P (θ ∣ D) = ar g θ max [lo g P (D ∣ θ) + lo g P (θ)]

MAP with a Gaussian prior is equivalent to L2-regularised MLE (the prior acts as a regulariser).

Posterior predictive distribution:

P (\tilde{x} ∣ D) = \int P (\tilde{x} ∣ θ) P (θ ∣ D) d θ

Integrates over parameter uncertainty — more principled than plugging in a point estimate.

Conjugate Priors

A prior $P (θ)$ is conjugate to a likelihood $P (D ∣ θ)$ if the posterior is in the same distributional family as the prior. This makes updates analytical:

Likelihood	Conjugate prior	Posterior
Bernoulli( $p$ )	Beta( $α, β$ )	Beta( $α + k, β + n - k$ )
Poisson( $λ$ )	Gamma( $α, β$ )	Gamma( $α + \sum x_{i}, β + n$ )
Gaussian (known $σ$ )	Gaussian( $μ_{0}, σ_{0}^{2}$ )	Gaussian
Categorical( $π$ )	Dirichlet( $α$ )	Dirichlet( $α + c$ )

Beta-Binomial example: after observing $k$ successes in $n$ trials with prior Beta( $α, β$ ):

P (p ∣ k, n) = Beta (α + k, β + n - k)

The posterior mean is $(α + k) / (α + β + n)$ , which interpolates between the prior mean and the MLE as $n$ grows.

Approximate Inference

When the posterior is intractable:

MCMC (Markov Chain Monte Carlo): sample from the posterior (Metropolis-Hastings, Hamiltonian MC via Stan/PyMC).
Variational inference (VI): approximate the posterior with a tractable family $q (θ)$ by minimising $KL (q ∥ p)$ .
Laplace approximation: fit a Gaussian at the MAP estimate.

Applications

Naive Bayes classifier: uses class-conditional independence to compute $P (y ∣ x) \propto P (y) \prod_{j} P (x_{j} ∣ y)$ .
Bayesian linear regression: places a Gaussian prior on weights; posterior is analytical and gives uncertainty estimates over predictions.
Topic models (LDA): Dirichlet-Multinomial hierarchy.
Bayesian hyperparameter optimisation (Gaussian processes): prior over functions, update with evaluated points.
Online learning: sequential Bayesian updates as new data arrives.

Trade-offs

Approach	Advantage	Limitation
Full Bayesian	Calibrated uncertainty, principled	Expensive (MCMC/VI), prior choice matters
MAP	Cheap, equivalent to regularised MLE	Point estimate, no uncertainty
MLE	Simplest, unbiased in large $n$	Overconfident, no regularisation

Prior choice can dominate with small datasets; with large datasets the likelihood overwhelms the prior and results converge to MLE.

Notes

Explorer

bayesian_inference

Bayesian Inference

Definition

Intuition

Formal Description

Conjugate Priors

Approximate Inference

Applications

Trade-offs

Links

Graph View

Table of Contents

Backlinks