Probabilistic Models

Definition

Models that explicitly represent probability distributions over outputs (and sometimes latent variables), enabling calibrated uncertainty estimates and principled inference.

Intuition

Unlike discriminative models that directly learn , generative probabilistic models model the joint distribution or the full data distribution . This enables not just prediction but density estimation, anomaly detection, and sampling.

Formal Description

Naive Bayes Classifier

Generative model: — assumes features are conditionally independent given the class.

Prediction: apply Bayes’ theorem:

Variants by feature type:

| Variant | model | Use case | |---|---|---| | Gaussian NB | | Continuous features | | Multinomial NB | Multinomial with | Text (word counts) | | Bernoulli NB | Bernoulli with | Binary features | | Complement NB | Complement classes | Text (better for imbalanced) |

Laplace smoothing prevents zero probabilities: .

Strength: extremely fast, works well on high-dimensional sparse text; robust to irrelevant features. Limitation: conditional independence is almost always violated, producing poor probability calibration (needs Platt scaling or isotonic regression).

Gaussian Mixture Model (GMM)

Models data as a mixture of Gaussians:

EM Algorithm for GMM:

E-step: compute responsibilities (posterior probability of each component given data point):

M-step: update parameters using responsibilities as soft assignments:

Covariance types (covariance_type in sklearn):

TypeParametersConstraint
full full matricesMost flexible, most parameters
tied1 shared matrixShared shape across components
diag diagonal matricesAxis-aligned ellipsoids
spherical scalarsSpherical components

Model selection: use BIC to choose : ; penalises model complexity.

Relation to k-means: k-means is a special case of GMM EM with spherical equal-variance components and hard assignments (0/1 instead of soft ).

Bayesian Linear Regression

Places a Gaussian prior over weights: .

With Gaussian likelihood, the posterior is Gaussian:

The MAP estimate is identical to ridge regression with . The full posterior gives predictive uncertainty that increases away from training data — richer than point estimates.

Applications

  • Naive Bayes: spam filtering, document classification, text categorisation
  • GMM: density estimation, soft clustering, anomaly detection, speaker diarisation
  • Bayesian linear regression: uncertainty-aware prediction, active learning

Trade-offs

  • Naive Bayes: strong independence assumption → poor calibration; but fast, interpretable, good baseline.
  • GMM: sensitive to initialisation; EM can converge to local optima; n_init > 1 is recommended.
  • Bayesian regression: analytical but scales as for covariance inversion; use sparse priors for high-dimensional problems.