Kernel Methods and SVMs

Definition

Kernel methods use a kernel function to implicitly map inputs into a high-dimensional (possibly infinite) feature space, allowing linear algorithms to learn non-linear decision boundaries. Support Vector Machines (SVMs) are the most prominent kernel method.

Intuition

In the original feature space, the data may not be linearly separable. Mapping to a higher-dimensional space can make it separable. The kernel trick avoids explicitly computing this mapping — it computes inner products in the high-dimensional space directly from the original inputs. SVMs then find the maximum-margin hyperplane in that space.

Formal Description

Hard-Margin SVM (Linearly Separable)

Find the hyperplane with maximum margin :

Dual problem (via Lagrangian):

Support vectors: training points with (on the margin boundaries). All other .

Prediction:

Soft-Margin SVM (C-SVM)

Introduces slack variables to allow margin violations:

controls the trade-off: large → low tolerance for misclassifications → narrower margin; small → wide margin, more misclassifications.

Dual with box constraints: .

The Kernel Trick

Replace with a kernel function :

Common kernels:

KernelFormulaProperties
LinearNo mapping; fast
PolynomialDegree- interactions
RBF (Gaussian)Infinite-dimensional; most popular
SigmoidSimilar to single-layer NN

Mercer’s condition: a symmetric function is a valid kernel iff its Gram matrix is positive semidefinite for any finite sample.

RBF kernel: the bandwidth controls the smoothness of the decision boundary. Large → narrow Gaussians → complex boundary; small → smooth boundary.

SVR (Support Vector Regression)

-insensitive loss: ignore residuals ; penalise larger residuals linearly. Dual formulation analogous to SVM.

Kernel Ridge Regression

Closed-form kernel regression: , where and . Equivalent to ridge regression in the RKHS.

Applications

  • SVMs were dominant in text classification, image recognition, and bioinformatics before deep learning.
  • Kernel methods remain useful for small datasets or specialised kernels (e.g., string kernels for sequences, graph kernels).
  • SVM classifiers are a good baseline when the dataset fits in memory and has < ~10k samples.

Trade-offs

AspectSVMNeural Network
Small data✅ Works well❌ May overfit
Large data kernel matrix✅ Scalable with SGD
InterpretabilityModerate (support vectors)Low
Kernel choiceRequires domain knowledgeLearned end-to-end