Kernel Methods and SVMs
Definition
Kernel methods use a kernel function to implicitly map inputs into a high-dimensional (possibly infinite) feature space, allowing linear algorithms to learn non-linear decision boundaries. Support Vector Machines (SVMs) are the most prominent kernel method.
Intuition
In the original feature space, the data may not be linearly separable. Mapping to a higher-dimensional space can make it separable. The kernel trick avoids explicitly computing this mapping — it computes inner products in the high-dimensional space directly from the original inputs. SVMs then find the maximum-margin hyperplane in that space.
Formal Description
Hard-Margin SVM (Linearly Separable)
Find the hyperplane with maximum margin :
Dual problem (via Lagrangian):
Support vectors: training points with (on the margin boundaries). All other .
Prediction:
Soft-Margin SVM (C-SVM)
Introduces slack variables to allow margin violations:
controls the trade-off: large → low tolerance for misclassifications → narrower margin; small → wide margin, more misclassifications.
Dual with box constraints: .
The Kernel Trick
Replace with a kernel function :
Common kernels:
| Kernel | Formula | Properties |
|---|---|---|
| Linear | No mapping; fast | |
| Polynomial | Degree- interactions | |
| RBF (Gaussian) | Infinite-dimensional; most popular | |
| Sigmoid | Similar to single-layer NN |
Mercer’s condition: a symmetric function is a valid kernel iff its Gram matrix is positive semidefinite for any finite sample.
RBF kernel: the bandwidth controls the smoothness of the decision boundary. Large → narrow Gaussians → complex boundary; small → smooth boundary.
SVR (Support Vector Regression)
-insensitive loss: ignore residuals ; penalise larger residuals linearly. Dual formulation analogous to SVM.
Kernel Ridge Regression
Closed-form kernel regression: , where and . Equivalent to ridge regression in the RKHS.
Applications
- SVMs were dominant in text classification, image recognition, and bioinformatics before deep learning.
- Kernel methods remain useful for small datasets or specialised kernels (e.g., string kernels for sequences, graph kernels).
- SVM classifiers are a good baseline when the dataset fits in memory and has < ~10k samples.
Trade-offs
| Aspect | SVM | Neural Network |
|---|---|---|
| Small data | ✅ Works well | ❌ May overfit |
| Large data | ❌ kernel matrix | ✅ Scalable with SGD |
| Interpretability | Moderate (support vectors) | Low |
| Kernel choice | Requires domain knowledge | Learned end-to-end |