Probabilistic Models

Definition

Models that explicitly represent probability distributions over outputs (and sometimes latent variables), enabling calibrated uncertainty estimates and principled inference.

Intuition

Unlike discriminative models that directly learn $P (y ∣ x)$ , generative probabilistic models model the joint distribution $P (x, y)$ or the full data distribution $P (x)$ . This enables not just prediction but density estimation, anomaly detection, and sampling.

Formal Description

Naive Bayes Classifier

Generative model: $P (y, x) = P (y) \prod_{j = 1}^{d} P (x_{j} ∣ y)$ — assumes features are conditionally independent given the class.

Prediction: apply Bayes’ theorem:

P (y ∣ x) \propto P (y) j = 1 \prod d P (x_{j} ∣ y)

Variants by feature type:

| Variant | $P (x_{j} ∣ y)$ model | Use case | |---|---|---| | Gaussian NB | $N (μ_{j y}, σ_{j y}^{2})$ | Continuous features | | Multinomial NB | Multinomial with $θ_{j y}$ | Text (word counts) | | Bernoulli NB | Bernoulli with $p_{j y}$ | Binary features | | Complement NB | Complement classes | Text (better for imbalanced) |

Laplace smoothing prevents zero probabilities: $\hat{P} (x_{j} ∣ y) = (c_{j y} + α) / (c_{y} + α d)$ .

Strength: extremely fast, works well on high-dimensional sparse text; robust to irrelevant features. Limitation: conditional independence is almost always violated, producing poor probability calibration (needs Platt scaling or isotonic regression).

Gaussian Mixture Model (GMM)

Models data as a mixture of $K$ Gaussians:

p (x) = k = 1 \sum K π_{k} N (x; μ_{k}, Σ_{k}), k \sum π_{k} = 1, π_{k} \geq 0

EM Algorithm for GMM:

E-step: compute responsibilities (posterior probability of each component given data point):

r_{ik} = \frac{π _{k} N ( x _{i} ; μ _{k} , Σ _{k} )}{\sum _{j} π _{j} N ( x _{i} ; μ _{j} , Σ _{j} )}

M-step: update parameters using responsibilities as soft assignments:

π_{k}^{new} = \frac{1}{n} i \sum r_{ik}, μ_{k}^{new} = \frac{\sum _{i} r _{ik} x _{i}}{\sum _{i} r _{ik}}

Σ_{k}^{new} = \frac{\sum _{i} r _{ik} ( x _{i} - μ _{k}^{new} ) ( x _{i} - μ _{k}^{new} ) ^{⊤}}{\sum _{i} r _{ik}}

Covariance types (covariance_type in sklearn):

Type	Parameters	Constraint
`full`	$K$ full matrices	Most flexible, most parameters
`tied`	1 shared matrix	Shared shape across components
`diag`	$K$ diagonal matrices	Axis-aligned ellipsoids
`spherical`	$K$ scalars	Spherical components

Model selection: use BIC to choose $K$ : $BIC = k ln n - 2 ln \hat{L}$ ; penalises model complexity.

Relation to k-means: k-means is a special case of GMM EM with spherical equal-variance components and hard assignments (0/1 instead of soft $r_{ik}$ ).

Bayesian Linear Regression

Places a Gaussian prior over weights: $w \sim N (0, σ_{0}^{2} I)$ .

With Gaussian likelihood, the posterior is Gaussian:

w ∣ D \sim N (μ_{n}, Σ_{n})

Σ_{n} = (σ_{0}^{- 2} I + σ^{- 2} X^{⊤} X)^{- 1}, μ_{n} = σ^{- 2} Σ_{n} X^{⊤} y

The MAP estimate $μ_{n}$ is identical to ridge regression with $λ = σ^{2} / σ_{0}^{2}$ . The full posterior gives predictive uncertainty that increases away from training data — richer than point estimates.

Applications

Naive Bayes: spam filtering, document classification, text categorisation
GMM: density estimation, soft clustering, anomaly detection, speaker diarisation
Bayesian linear regression: uncertainty-aware prediction, active learning

Trade-offs

Naive Bayes: strong independence assumption → poor calibration; but fast, interpretable, good baseline.
GMM: sensitive to initialisation; EM can converge to local optima; n_init > 1 is recommended.
Bayesian regression: analytical but scales as $O (d^{3})$ for covariance inversion; use sparse priors for high-dimensional problems.

Notes

Explorer

probabilistic_models

Probabilistic Models

Definition

Intuition

Formal Description

Naive Bayes Classifier

Gaussian Mixture Model (GMM)

Bayesian Linear Regression

Applications

Trade-offs

Links

Graph View

Table of Contents

Backlinks