Probabilistic Models
Definition
Models that explicitly represent probability distributions over outputs (and sometimes latent variables), enabling calibrated uncertainty estimates and principled inference.
Intuition
Unlike discriminative models that directly learn , generative probabilistic models model the joint distribution or the full data distribution . This enables not just prediction but density estimation, anomaly detection, and sampling.
Formal Description
Naive Bayes Classifier
Generative model: — assumes features are conditionally independent given the class.
Prediction: apply Bayes’ theorem:
Variants by feature type:
| Variant | model | Use case | |---|---|---| | Gaussian NB | | Continuous features | | Multinomial NB | Multinomial with | Text (word counts) | | Bernoulli NB | Bernoulli with | Binary features | | Complement NB | Complement classes | Text (better for imbalanced) |
Laplace smoothing prevents zero probabilities: .
Strength: extremely fast, works well on high-dimensional sparse text; robust to irrelevant features. Limitation: conditional independence is almost always violated, producing poor probability calibration (needs Platt scaling or isotonic regression).
Gaussian Mixture Model (GMM)
Models data as a mixture of Gaussians:
EM Algorithm for GMM:
E-step: compute responsibilities (posterior probability of each component given data point):
M-step: update parameters using responsibilities as soft assignments:
Covariance types (covariance_type in sklearn):
| Type | Parameters | Constraint |
|---|---|---|
full | full matrices | Most flexible, most parameters |
tied | 1 shared matrix | Shared shape across components |
diag | diagonal matrices | Axis-aligned ellipsoids |
spherical | scalars | Spherical components |
Model selection: use BIC to choose : ; penalises model complexity.
Relation to k-means: k-means is a special case of GMM EM with spherical equal-variance components and hard assignments (0/1 instead of soft ).
Bayesian Linear Regression
Places a Gaussian prior over weights: .
With Gaussian likelihood, the posterior is Gaussian:
The MAP estimate is identical to ridge regression with . The full posterior gives predictive uncertainty that increases away from training data — richer than point estimates.
Applications
- Naive Bayes: spam filtering, document classification, text categorisation
- GMM: density estimation, soft clustering, anomaly detection, speaker diarisation
- Bayesian linear regression: uncertainty-aware prediction, active learning
Trade-offs
- Naive Bayes: strong independence assumption → poor calibration; but fast, interpretable, good baseline.
- GMM: sensitive to initialisation; EM can converge to local optima;
n_init > 1is recommended. - Bayesian regression: analytical but scales as for covariance inversion; use sparse priors for high-dimensional problems.