Kernel Methods and SVMs

Definition

Kernel methods use a kernel function to implicitly map inputs into a high-dimensional (possibly infinite) feature space, allowing linear algorithms to learn non-linear decision boundaries. Support Vector Machines (SVMs) are the most prominent kernel method.

Intuition

In the original feature space, the data may not be linearly separable. Mapping to a higher-dimensional space can make it separable. The kernel trick avoids explicitly computing this mapping — it computes inner products in the high-dimensional space directly from the original inputs. SVMs then find the maximum-margin hyperplane in that space.

Formal Description

Hard-Margin SVM (Linearly Separable)

Find the hyperplane $w^{⊤} x + b = 0$ with maximum margin $2/∥ w ∥$ :

w, b min \frac{1}{2} ∥ w ∥^{2} s.t. y_{i} (w^{⊤} x_{i} + b) \geq 1 \forall i

Dual problem (via Lagrangian):

α max i \sum α_{i} - \frac{1}{2} i, j \sum α_{i} α_{j} y_{i} y_{j} x_{i}^{⊤} x_{j}

s.t. i \sum α_{i} y_{i} = 0, α_{i} \geq 0

Support vectors: training points with $α_{i} > 0$ (on the margin boundaries). All other $α_{i} = 0$ .

Prediction: $\overset{y}{^} = sign (\sum_{i} α_{i} y_{i} x_{i}^{⊤} x + b)$

Soft-Margin SVM (C-SVM)

Introduces slack variables $ξ_{i} \geq 0$ to allow margin violations:

w, b, ξ min \frac{1}{2} ∥ w ∥^{2} + C i \sum ξ_{i} s.t. y_{i} (w^{⊤} x_{i} + b) \geq 1 - ξ_{i}, ξ_{i} \geq 0

$C > 0$ controls the trade-off: large $C$ → low tolerance for misclassifications → narrower margin; small $C$ → wide margin, more misclassifications.

Dual with box constraints: $0 \leq α_{i} \leq C$ .

The Kernel Trick

Replace $x_{i}^{⊤} x_{j}$ with a kernel function $k (x_{i}, x_{j}) = ϕ (x_{i})^{⊤} ϕ (x_{j})$ :

\overset{y}{^} = sign (i \sum α_{i} y_{i} k (x_{i}, x) + b)

Common kernels:

Kernel	Formula	Properties
Linear	$x_{i}^{⊤} x_{j}$	No mapping; fast
Polynomial	$(γ x_{i}^{⊤} x_{j} + r)^{d}$	Degree- $d$ interactions
RBF (Gaussian)	$exp (- γ ∥ x_{i} - x_{j} ∥^{2})$	Infinite-dimensional; most popular
Sigmoid	$tanh (γ x_{i}^{⊤} x_{j} + r)$	Similar to single-layer NN

Mercer’s condition: a symmetric function $k$ is a valid kernel iff its Gram matrix $K_{ij} = k (x_{i}, x_{j})$ is positive semidefinite for any finite sample.

RBF kernel: the bandwidth $γ = 1/ (2 σ^{2})$ controls the smoothness of the decision boundary. Large $γ$ → narrow Gaussians → complex boundary; small $γ$ → smooth boundary.

SVR (Support Vector Regression)

$ε$ -insensitive loss: ignore residuals $< ε$ ; penalise larger residuals linearly. Dual formulation analogous to SVM.

Kernel Ridge Regression

Closed-form kernel regression: $\hat{f} (x) = k^{⊤} (K + λ I)^{- 1} y$ , where $K_{ij} = k (x_{i}, x_{j})$ and $k_{i} = k (x_{i}, x)$ . Equivalent to ridge regression in the RKHS.

Applications

SVMs were dominant in text classification, image recognition, and bioinformatics before deep learning.
Kernel methods remain useful for small datasets or specialised kernels (e.g., string kernels for sequences, graph kernels).
SVM classifiers are a good baseline when the dataset fits in memory and has < ~10k samples.

Trade-offs

Aspect	SVM	Neural Network
Small data	✅ Works well	❌ May overfit
Large data	❌ $O (n^{2})$ kernel matrix	✅ Scalable with SGD
Interpretability	Moderate (support vectors)	Low
Kernel choice	Requires domain knowledge	Learned end-to-end

Notes

Explorer

kernel_methods