Density Estimation

Core Idea

Density estimation constructs an estimate of the underlying data-generating distribution from observed samples. It is used for anomaly detection (low-density points are anomalous), generative modelling, and as a component of Bayesian classifiers.

Mathematical Formulation

Kernel Density Estimation (KDE)

Non-parametric: places a kernel centred at each training point and sums:

where is the bandwidth and is the dimensionality.

Common kernels: Gaussian (), Epanechnikov (optimal in MSE sense), tophat.

Bandwidth selection: Silverman’s rule of thumb: (Gaussian kernel, univariate). Cross-validated bandwidth selection is preferred in practice.

from sklearn.neighbors import KernelDensity
import numpy as np
 
kde = KernelDensity(bandwidth=0.5, kernel='gaussian')
kde.fit(X_train)
 
log_density = kde.score_samples(X_test)  # log p(x) for each test point
anomaly_mask = log_density < np.percentile(log_density, 5)  # bottom 5%

Gaussian Mixture Model (GMM)

Parametric: models the data as a mixture of Gaussian components:

Parameters: mixing weights , ; means ; covariances .

EM algorithm alternates between:

  • E-step: compute posterior responsibilities .
  • M-step: update , , as weighted sufficient statistics.

EM converges to a local maximum of the log-likelihood .

from sklearn.mixture import GaussianMixture
 
gmm = GaussianMixture(n_components=3, covariance_type='full', random_state=42)
gmm.fit(X_train)
 
log_probs = gmm.score_samples(X_test)    # log p(x)
labels    = gmm.predict(X_test)           # most likely component

Covariance types:

TypeConstraintParameters
fullEach component has its own Most flexible
tiedAll components share Fewer parameters
diagDiagonal (uncorrelated features)Fast, scalable
sphericalSimplest

Inductive Bias

  • KDE: makes no parametric assumption about shape; density is always positive; smooth with sufficient bandwidth.
  • GMM: assumes the distribution is a finite mixture of Gaussians; unimodal within each component.

Training Objective

  • KDE: no training; density is computed directly from data.
  • GMM: maximise log-likelihood via EM; selects using BIC or AIC.

Model Selection for GMM

bic_scores = []
for k in range(1, 11):
    gmm = GaussianMixture(n_components=k, covariance_type='full', random_state=42)
    gmm.fit(X_train)
    bic_scores.append(gmm.bic(X_train))
 
optimal_k = 1 + bic_scores.index(min(bic_scores))

Strengths

  • KDE makes no parametric assumption and estimates any shape.
  • GMM provides a fully probabilistic model with interpretable components.
  • Both can be used for anomaly detection (low → anomalous).

Weaknesses

  • KDE scales poorly with dimensionality (curse of dimensionality: bandwidth must grow exponentially with ).
  • GMM requires specifying and assumes Gaussian components.
  • EM can converge to poor local optima; use multiple random initialisations.

Variants

  • Variational Bayes GMM (BayesianGaussianMixture): places priors on mixture weights, automatically shrinks unused components — effectively selects .
  • Normalising Flows: learned invertible transformations that map a simple distribution (Gaussian) to a complex one, giving exact density.
  • Kernel Mixture Models: KDE with learnable bandwidths per component.

References

  • Silverman, B.W. (1986). Density Estimation for Statistics and Data Analysis. Chapman & Hall.
  • Bishop, C. (2006). Pattern Recognition and Machine Learning. §9 (EM and GMMs).