Clustering
Definition
Clustering partitions an unlabelled dataset into groups (clusters) such that points within a cluster are more similar to each other than to points in other clusters. It is the primary task in unsupervised learning when the goal is to discover discrete structure in data.
Intuition
Given data points with no labels, clustering surfaces natural groupings. The notion of “similar” is algorithm-specific: K-Means uses Euclidean distance to centroid; DBSCAN uses local density; GMM uses probabilistic mixture components. The right algorithm depends on expected cluster shape, the need to handle noise, and whether is known a priori.
Formal Description
K-Means
Partitions points into clusters by minimising within-cluster sum of squares (WCSS):
Algorithm (Lloyd’s):
- Initialise centroids — k-means++ samples the first centroid uniformly, then picks each subsequent centroid with probability proportional to , dramatically reducing the chance of a bad initialisation.
- Assign:
- Update:
- Repeat 2–3 until convergence.
Guaranteed to converge but only to a local minimum; set n_init ≥ 10 in sklearn to run multiple restarts and keep the best.
Choosing K:
- Elbow method: plot WCSS vs ; pick the elbow where marginal gains flatten.
- Silhouette score: pick maximising mean silhouette; more principled than the elbow.
- Gap statistic: compare within-cluster dispersion to that of a reference random distribution.
Silhouette coefficient for point :
where = mean intra-cluster distance, = mean distance to nearest other cluster. Range: ; values near 1 indicate well-separated clusters.
Limitations: assumes spherical, equal-variance clusters; sensitive to outliers and feature scaling; must be specified.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
Groups points that are density-connected: reachable through chains of core points (points with min_samples neighbours within radius eps).
Point classification:
- Core point: has
min_samplespoints withineps. - Border point: within
epsof a core point but below density threshold itself. - Noise point: not within
epsof any core point; assigned label .
Hyperparameter selection:
eps: use the k-distance graph — sort the -th nearest-neighbour distances and pick the knee.min_samples: rule of thumb where is the feature dimension; larger values suppress noise.
Properties:
- Discovers arbitrary-shaped clusters with no need to specify .
- Explicitly labels outliers — useful for anomaly detection.
- Sensitive to varying densities across clusters; HDBSCAN addresses this by building a hierarchy over multiple
epsvalues. - Struggles in high dimensions due to the curse of dimensionality making distance measures less discriminative.
HDBSCAN (Hierarchical DBSCAN)
Extends DBSCAN by building a cluster hierarchy across all density levels and extracting the most stable clusters. Key advantage: no need to choose eps; only min_cluster_size is required. Available in sklearn ≥ 1.3 as sklearn.cluster.HDBSCAN.
Hierarchical Clustering (Agglomerative)
Builds a nested sequence of partitions by merging the closest pair of clusters bottom-up. The result is a dendrogram — cut at a chosen height to obtain clusters.
Linkage criteria (determines “distance between clusters”):
| Linkage | Merges based on | Tendency |
|---|---|---|
ward | Minimise increase in total WCSS | Compact, equally-sized clusters |
complete | Maximum pairwise distance | Avoids elongated clusters |
average | Mean pairwise distance | Balanced |
single | Minimum pairwise distance | Chaining — sensitive to outliers |
Ward linkage is the default in sklearn and usually gives the most interpretable results.
Complexity: time; impractical for . Use n_clusters to cut the dendrogram, or distance_threshold to cut by height.
GMM as Soft Clustering
A Gaussian Mixture Model assigns each point a soft probability of belonging to each of Gaussian components, rather than a hard label. K-Means is a degenerate GMM in the limit of zero covariance within clusters.
GMM handles ellipsoidal clusters (non-spherical covariance) and produces probabilistic assignments via predict_proba(). Select via BIC/AIC. Full theory is in Probabilistic Models.
Applications
- K-Means: customer segmentation, image compression (vector quantisation), codebook generation for bag-of-words models.
- DBSCAN / HDBSCAN: geospatial clustering of GPS traces, noise-robust document clustering, satellite imagery analysis.
- Hierarchical: taxonomic/biological classification, document taxonomy, exploratory analysis where the full dendrogram is informative.
- GMM: speaker identification, background modelling in computer vision, initialisation for EM in HMMs.
Trade-offs
| Algorithm | Cluster shape | Handles noise | required | Scalability | Soft assignments |
|---|---|---|---|---|---|
| K-Means | Spherical | No | Yes | ✅ | No |
| DBSCAN | Arbitrary | Yes (explicit) | No | No | |
| HDBSCAN | Arbitrary | Yes (explicit) | No (min_size only) | Soft available | |
| Agglomerative | Arbitrary (linkage-dep.) | No | Yes (cut height) | ❌ | No |
| GMM | Ellipsoidal | No | Yes | Yes |
References
- Arthur, D. & Vassilvitskii, S. (2007). “k-means++: The advantages of careful seeding.” SODA.
- Ester, M. et al. (1996). “A density-based algorithm for discovering clusters in large spatial databases with noise.” KDD.
- Campello, R.J.G.B. et al. (2013). “Density-based clustering based on hierarchical density estimates.” PAKDD (HDBSCAN).
- sklearn clustering user guide: https://scikit-learn.org/stable/modules/clustering.html