Instance-Based Methods

Core Idea

Instance-based (or memory-based) methods make predictions by finding the training examples most similar to a query point, rather than fitting a global parametric model. The hypothesis is “local”: nearby training examples are most relevant for predicting any given test point.

Mathematical Formulation

K-Nearest Neighbours (KNN)

Given a query point $x^{*}$ and training set ${(x_{i}, y_{i})}_{i = 1}^{n}$ , identify the $k$ training points closest under distance $d$ :

N_{k} (x^{*}) = top- k points by d (x_{i}, x^{*})

Classification: majority vote over neighbours:

\overset{y}{^} = mode {y_{i} : x_{i} \in N_{k} (x^{*})}

Regression: mean (or distance-weighted mean) over neighbours:

\overset{y}{^} = \frac{1}{k} x_{i} \in N_{k} (x^{*}) \sum y_{i} (uniform)

\overset{y}{^} = \frac{\sum _{i} w _{i} y _{i}}{\sum _{i} w _{i}}, w_{i} = \frac{1}{d ( x ^{*} , x _{i} ) ^{2}} (distance-weighted)

Distance Metrics

Metric	Formula	Use case
Euclidean ( $L_{2}$ )	$\sum_{j} (x_{j} - x_{j}^{'})^{2}$	Continuous, homogeneous features
Manhattan ( $L_{1}$ )	$\sum_j	x_j - x’_j
Minkowski	$(\sum_j	x_j - x’_j
Cosine	$1 - \frac{x \cdot x ^{'}}{∥ x ∥∥ x ^{'} ∥}$	High-dimensional text/embeddings
Hamming	Proportion of differing components	Binary or categorical features

Standardise features before computing Euclidean distance — otherwise high-variance features dominate.

Inductive Bias

KNN assumes: nearby points (in feature space) have similar labels. This is the smoothness assumption, also called the manifold hypothesis in its stronger form. It holds well for locally smooth functions and fails when the decision boundary is globally structured or when the feature space is very high-dimensional.

Training Objective

KNN has no training phase — the training set is the model. Prediction requires searching over all training points, making prediction $O (n \cdot d)$ for brute-force search. Practical implementations use:

KD-trees: $O (d lo g n)$ for low-dimensional data ( $d ≲ 20$ ).
Ball trees: better than KD-trees for $d > 20$ .
Approximate nearest neighbours (FAISS, HNSW): sub-linear search for large-scale retrieval.

Strengths

No training time: simply stores the data.
Non-parametric: adapts to arbitrary decision boundary shapes.
Naturally handles multi-class: majority vote scales to any $K$ .
Interpretable locally: the prediction is explained by the neighbours.

Weaknesses

Slow prediction: $O (n)$ per query (brute force); mitigated by ANN indices.
High memory: must store entire training set.
Curse of dimensionality: in high dimensions, distance concentrates and neighbourhoods become meaningless. Performance degrades rapidly for $d ≳ 20$ .
No feature learning: cannot learn task-relevant representations — sensitive to irrelevant features.
Sensitive to scale: requires feature standardisation.

Variants

Radius-based neighbours: use all points within radius $r$ instead of fixed $k$ .
Locally weighted regression (LOESS): fit a local polynomial to neighbours weighted by distance; used for smooth curve estimation.
Condensed KNN: removes redundant training points to reduce storage.
Weighted KNN: distance-weighted voting typically outperforms uniform KNN.

References

Cover, T. & Hart, P. (1967). “Nearest neighbor pattern classification.” IEEE Transactions on Information Theory.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning. §13.3.

Notes

Explorer

instance_based_methods