Clustering

Definition

Clustering partitions an unlabelled dataset into groups (clusters) such that points within a cluster are more similar to each other than to points in other clusters. It is the primary task in unsupervised learning when the goal is to discover discrete structure in data.

Intuition

Given data points with no labels, clustering surfaces natural groupings. The notion of “similar” is algorithm-specific: K-Means uses Euclidean distance to centroid; DBSCAN uses local density; GMM uses probabilistic mixture components. The right algorithm depends on expected cluster shape, the need to handle noise, and whether is known a priori.

Formal Description

K-Means

Partitions points into clusters by minimising within-cluster sum of squares (WCSS):

Algorithm (Lloyd’s):

  1. Initialise centroids — k-means++ samples the first centroid uniformly, then picks each subsequent centroid with probability proportional to , dramatically reducing the chance of a bad initialisation.
  2. Assign:
  3. Update:
  4. Repeat 2–3 until convergence.

Guaranteed to converge but only to a local minimum; set n_init ≥ 10 in sklearn to run multiple restarts and keep the best.

Choosing K:

  • Elbow method: plot WCSS vs ; pick the elbow where marginal gains flatten.
  • Silhouette score: pick maximising mean silhouette; more principled than the elbow.
  • Gap statistic: compare within-cluster dispersion to that of a reference random distribution.

Silhouette coefficient for point :

where = mean intra-cluster distance, = mean distance to nearest other cluster. Range: ; values near 1 indicate well-separated clusters.

Limitations: assumes spherical, equal-variance clusters; sensitive to outliers and feature scaling; must be specified.

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

Groups points that are density-connected: reachable through chains of core points (points with min_samples neighbours within radius eps).

Point classification:

  • Core point: has min_samples points within eps.
  • Border point: within eps of a core point but below density threshold itself.
  • Noise point: not within eps of any core point; assigned label .

Hyperparameter selection:

  • eps: use the k-distance graph — sort the -th nearest-neighbour distances and pick the knee.
  • min_samples: rule of thumb where is the feature dimension; larger values suppress noise.

Properties:

  • Discovers arbitrary-shaped clusters with no need to specify .
  • Explicitly labels outliers — useful for anomaly detection.
  • Sensitive to varying densities across clusters; HDBSCAN addresses this by building a hierarchy over multiple eps values.
  • Struggles in high dimensions due to the curse of dimensionality making distance measures less discriminative.

HDBSCAN (Hierarchical DBSCAN)

Extends DBSCAN by building a cluster hierarchy across all density levels and extracting the most stable clusters. Key advantage: no need to choose eps; only min_cluster_size is required. Available in sklearn ≥ 1.3 as sklearn.cluster.HDBSCAN.

Hierarchical Clustering (Agglomerative)

Builds a nested sequence of partitions by merging the closest pair of clusters bottom-up. The result is a dendrogram — cut at a chosen height to obtain clusters.

Linkage criteria (determines “distance between clusters”):

LinkageMerges based onTendency
wardMinimise increase in total WCSSCompact, equally-sized clusters
completeMaximum pairwise distanceAvoids elongated clusters
averageMean pairwise distanceBalanced
singleMinimum pairwise distanceChaining — sensitive to outliers

Ward linkage is the default in sklearn and usually gives the most interpretable results.

Complexity: time; impractical for . Use n_clusters to cut the dendrogram, or distance_threshold to cut by height.

GMM as Soft Clustering

A Gaussian Mixture Model assigns each point a soft probability of belonging to each of Gaussian components, rather than a hard label. K-Means is a degenerate GMM in the limit of zero covariance within clusters.

GMM handles ellipsoidal clusters (non-spherical covariance) and produces probabilistic assignments via predict_proba(). Select via BIC/AIC. Full theory is in Probabilistic Models.

Applications

  • K-Means: customer segmentation, image compression (vector quantisation), codebook generation for bag-of-words models.
  • DBSCAN / HDBSCAN: geospatial clustering of GPS traces, noise-robust document clustering, satellite imagery analysis.
  • Hierarchical: taxonomic/biological classification, document taxonomy, exploratory analysis where the full dendrogram is informative.
  • GMM: speaker identification, background modelling in computer vision, initialisation for EM in HMMs.

Trade-offs

AlgorithmCluster shapeHandles noise requiredScalabilitySoft assignments
K-MeansSphericalNoYesNo
DBSCANArbitraryYes (explicit)NoNo
HDBSCANArbitraryYes (explicit)No (min_size only)Soft available
AgglomerativeArbitrary (linkage-dep.)NoYes (cut height)No
GMMEllipsoidalNoYesYes

References

  • Arthur, D. & Vassilvitskii, S. (2007). “k-means++: The advantages of careful seeding.” SODA.
  • Ester, M. et al. (1996). “A density-based algorithm for discovering clusters in large spatial databases with noise.” KDD.
  • Campello, R.J.G.B. et al. (2013). “Density-based clustering based on hierarchical density estimates.” PAKDD (HDBSCAN).
  • sklearn clustering user guide: https://scikit-learn.org/stable/modules/clustering.html