Density Estimation
Core Idea
Density estimation constructs an estimate of the underlying data-generating distribution from observed samples. It is used for anomaly detection (low-density points are anomalous), generative modelling, and as a component of Bayesian classifiers.
Mathematical Formulation
Kernel Density Estimation (KDE)
Non-parametric: places a kernel centred at each training point and sums:
where is the bandwidth and is the dimensionality.
Common kernels: Gaussian (), Epanechnikov (optimal in MSE sense), tophat.
Bandwidth selection: Silverman’s rule of thumb: (Gaussian kernel, univariate). Cross-validated bandwidth selection is preferred in practice.
from sklearn.neighbors import KernelDensity
import numpy as np
kde = KernelDensity(bandwidth=0.5, kernel='gaussian')
kde.fit(X_train)
log_density = kde.score_samples(X_test) # log p(x) for each test point
anomaly_mask = log_density < np.percentile(log_density, 5) # bottom 5%Gaussian Mixture Model (GMM)
Parametric: models the data as a mixture of Gaussian components:
Parameters: mixing weights , ; means ; covariances .
EM algorithm alternates between:
- E-step: compute posterior responsibilities .
- M-step: update , , as weighted sufficient statistics.
EM converges to a local maximum of the log-likelihood .
from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(n_components=3, covariance_type='full', random_state=42)
gmm.fit(X_train)
log_probs = gmm.score_samples(X_test) # log p(x)
labels = gmm.predict(X_test) # most likely componentCovariance types:
| Type | Constraint | Parameters |
|---|---|---|
full | Each component has its own | Most flexible |
tied | All components share | Fewer parameters |
diag | Diagonal (uncorrelated features) | Fast, scalable |
spherical | Simplest |
Inductive Bias
- KDE: makes no parametric assumption about shape; density is always positive; smooth with sufficient bandwidth.
- GMM: assumes the distribution is a finite mixture of Gaussians; unimodal within each component.
Training Objective
- KDE: no training; density is computed directly from data.
- GMM: maximise log-likelihood via EM; selects using BIC or AIC.
Model Selection for GMM
bic_scores = []
for k in range(1, 11):
gmm = GaussianMixture(n_components=k, covariance_type='full', random_state=42)
gmm.fit(X_train)
bic_scores.append(gmm.bic(X_train))
optimal_k = 1 + bic_scores.index(min(bic_scores))Strengths
- KDE makes no parametric assumption and estimates any shape.
- GMM provides a fully probabilistic model with interpretable components.
- Both can be used for anomaly detection (low → anomalous).
Weaknesses
- KDE scales poorly with dimensionality (curse of dimensionality: bandwidth must grow exponentially with ).
- GMM requires specifying and assumes Gaussian components.
- EM can converge to poor local optima; use multiple random initialisations.
Variants
- Variational Bayes GMM (
BayesianGaussianMixture): places priors on mixture weights, automatically shrinks unused components — effectively selects . - Normalising Flows: learned invertible transformations that map a simple distribution (Gaussian) to a complex one, giving exact density.
- Kernel Mixture Models: KDE with learnable bandwidths per component.
References
- Silverman, B.W. (1986). Density Estimation for Statistics and Data Analysis. Chapman & Hall.
- Bishop, C. (2006). Pattern Recognition and Machine Learning. §9 (EM and GMMs).