Density Estimation

Core Idea

Density estimation constructs an estimate $\overset{p}{^} (x)$ of the underlying data-generating distribution from observed samples. It is used for anomaly detection (low-density points are anomalous), generative modelling, and as a component of Bayesian classifiers.

Mathematical Formulation

Kernel Density Estimation (KDE)

Non-parametric: places a kernel $K$ centred at each training point and sums:

\overset{p}{^} (x) = \frac{1}{n h ^{d}} i = 1 \sum n K (\frac{x - x _{i}}{h})

where $h > 0$ is the bandwidth and $d$ is the dimensionality.

Common kernels: Gaussian ( $K (u) = \frac{1}{2 π} e^{- u^{2} /2}$ ), Epanechnikov (optimal in MSE sense), tophat.

Bandwidth selection: Silverman’s rule of thumb: $h = 1.06 \overset{σ}{^} n^{- 1/5}$ (Gaussian kernel, univariate). Cross-validated bandwidth selection is preferred in practice.

from sklearn.neighbors import KernelDensity
import numpy as np
 
kde = KernelDensity(bandwidth=0.5, kernel='gaussian')
kde.fit(X_train)
 
log_density = kde.score_samples(X_test)  # log p(x) for each test point
anomaly_mask = log_density < np.percentile(log_density, 5)  # bottom 5%

Gaussian Mixture Model (GMM)

Parametric: models the data as a mixture of $K$ Gaussian components:

p (x) = k = 1 \sum K π_{k} N (x; μ_{k}, Σ_{k})

Parameters: mixing weights $π_{k} \geq 0$ , $\sum_{k} π_{k} = 1$ ; means $μ_{k}$ ; covariances $Σ_{k}$ .

EM algorithm alternates between:

E-step: compute posterior responsibilities $r_{ik} = P (z_{i} = k ∣ x_{i})$ .
M-step: update $π_{k}$ , $μ_{k}$ , $Σ_{k}$ as weighted sufficient statistics.

EM converges to a local maximum of the log-likelihood $\sum_{i} lo g p (x_{i})$ .

from sklearn.mixture import GaussianMixture
 
gmm = GaussianMixture(n_components=3, covariance_type='full', random_state=42)
gmm.fit(X_train)
 
log_probs = gmm.score_samples(X_test)    # log p(x)
labels    = gmm.predict(X_test)           # most likely component

Covariance types:

Type	Constraint	Parameters
`full`	Each component has its own $Σ_{k}$	Most flexible
`tied`	All components share $Σ$	Fewer parameters
`diag`	Diagonal $Σ_{k}$ (uncorrelated features)	Fast, scalable
`spherical`	$Σ_{k} = σ_{k}^{2} I$	Simplest

Inductive Bias

KDE: makes no parametric assumption about shape; density is always positive; smooth with sufficient bandwidth.
GMM: assumes the distribution is a finite mixture of Gaussians; unimodal within each component.

Training Objective

KDE: no training; density is computed directly from data.
GMM: maximise log-likelihood $\sum_{i} lo g p (x_{i})$ via EM; selects $K$ using BIC or AIC.

Model Selection for GMM

bic_scores = []
for k in range(1, 11):
    gmm = GaussianMixture(n_components=k, covariance_type='full', random_state=42)
    gmm.fit(X_train)
    bic_scores.append(gmm.bic(X_train))
 
optimal_k = 1 + bic_scores.index(min(bic_scores))

Strengths

KDE makes no parametric assumption and estimates any shape.
GMM provides a fully probabilistic model with interpretable components.
Both can be used for anomaly detection (low $\overset{p}{^}$ → anomalous).

Weaknesses

KDE scales poorly with dimensionality (curse of dimensionality: bandwidth must grow exponentially with $d$ ).
GMM requires specifying $K$ and assumes Gaussian components.
EM can converge to poor local optima; use multiple random initialisations.

Variants

Variational Bayes GMM (BayesianGaussianMixture): places priors on mixture weights, automatically shrinks unused components — effectively selects $K$ .
Normalising Flows: learned invertible transformations that map a simple distribution (Gaussian) to a complex one, giving exact density.
Kernel Mixture Models: KDE with learnable bandwidths per component.

References

Silverman, B.W. (1986). Density Estimation for Statistics and Data Analysis. Chapman & Hall.
Bishop, C. (2006). Pattern Recognition and Machine Learning. §9 (EM and GMMs).

Notes

Explorer

density_estimation