SVM Mathematical Foundations¶
This note presents SVM from a mathematics-first perspective: constrained optimization, KKT conditions, primal-dual formulation, and kernelized extension.
SECTION 1: THEORY CONCEPTS - DEFINITIONS & RELATIONSHIPS¶
1.1 SVM as Quadratic Programming¶
SVM is a constrained quadratic optimization problem: - objective is quadratic in \(w\), - constraints are linear in \((w,b)\).
1.2 Decision Boundary and Margin¶
Decision boundary:
Supporting hyperplanes:
Margin width:
So maximizing margin equals minimizing \(\|w\|\).
1.2A Margin Geometry¶
Interpretation: - middle black line is decision boundary, - dashed green lines are supporting hyperplanes, - circled points are support vectors, - maximizing margin means maximizing distance to the support vectors.
1.3 Hard vs Soft Margin¶
- Hard margin: zero classification violations (linearly separable data).
- Soft margin: allows violations using slack variables for realistic data.
1.4 Role of KKT in SVM¶
KKT conditions provide optimality checks for SVM’s constrained problem: 1. stationarity 2. primal feasibility 3. dual feasibility 4. complementary slackness
1.5 Why Dual Form Is Important¶
Dual form: - introduces Lagrange multipliers \(\alpha_i\), - uses dot products between samples, - enables kernel trick, - yields sparse solution (support vectors only).
1.6 Kernel Trick Intuition¶
Nonlinear class boundaries in input space can become linear in transformed feature space. Kernel functions compute transformed-space dot products directly, avoiding explicit mapping.
1.6A Kernel Mapping Geometry¶
Left panel: classes are not linearly separable in original 2D space.
Right panel: after feature mapping, a linear separator exists.
SECTION 2: FORMULAE¶
2.1 Hard-Margin Primal¶
subject to
2.2 Soft-Margin Primal¶
subject to
2.3 Lagrangian (Soft Margin)¶
with
2.4 Dual Optimization¶
subject to
2.5 Prediction Function¶
2.6 KKT Conditions (SVM Form)¶
Primal feasibility:
Dual feasibility:
Complementary slackness:
Stationarity gives:
SECTION 3: SOLVED / DERIVATION-STYLE PROBLEMS¶
Problem 1: Why maximizing margin becomes minimizing \(\|w\|^2\)¶
Since margin = \(2/\|w\|\), maximizing margin is equivalent to minimizing \(\|w\|\), and for differentiability/convenience we minimize:
Problem 2: Build dual constraints from primal (soft margin)¶
From stationarity wrt \(\xi_i\):
with \(\alpha_i\ge0\), this yields:
From stationarity wrt \(b\):
These are the standard dual constraints.
Problem 3: Linear vs Nonlinear separability logic¶
If no linear separator exists in original space, use kernelized SVM:
The model remains linear in transformed space and nonlinear in input space.
SECTION 4: TRICKS & MENTAL MODELS¶
4.1 Two-Hyperplane Model¶
Think in three objects, not one: 1. class +1 support hyperplane, 2. class -1 support hyperplane, 3. middle decision hyperplane.
SVM maximizes distance between class hyperplanes.
4.2 Support Vector Insight¶
Only near-boundary points matter. Many training points can move without changing boundary if support vectors stay fixed.
4.3 Optimization Shortcut¶
When deriving, do not solve primal directly first. Write Lagrangian and move to dual quickly; kernelization becomes immediate.
4.4 Parameter Intuition¶
- high \(C\): low training error tolerance, narrower regularization
- low \(C\): more regularization, wider tolerance
- high \(\gamma\) (RBF): very local boundaries
- low \(\gamma\): smoother global boundaries
4.5 Practical Separation Check¶
If linear logistic regression performs strongly and boundary looks approximately linear, linear SVM is usually a strong baseline.
SECTION 5: IMPORTANT EXAM POINTS¶
- SVM is a constrained quadratic optimization problem.
- Hard margin assumes strict separability; soft margin handles violations via \(\xi_i\).
- Objective \(\frac12\|w\|^2\) is equivalent to maximizing geometric margin.
- KKT conditions are central to SVM derivation.
- Dual constraints are: \(0\le\alpha_i\le C\), \(\sum_i\alpha_i y_i=0\).
- Support vectors are points with nonzero \(\alpha_i\).
- Kernel trick replaces dot product with \(K(x_i,x_j)\).
- SVM remains linear in transformed space.
- Decision function uses only support vectors.
- Scaling features before SVM is practically mandatory.