SVM Mathematical Foundations¶

This note presents SVM from a mathematics-first perspective: constrained optimization, KKT conditions, primal-dual formulation, and kernelized extension.

SECTION 1: THEORY CONCEPTS - DEFINITIONS & RELATIONSHIPS¶

1.1 SVM as Quadratic Programming¶

SVM is a constrained quadratic optimization problem: - objective is quadratic in \(w\), - constraints are linear in \((w,b)\).

1.2 Decision Boundary and Margin¶

Decision boundary:

\[ w^Tx + b = 0 \]

Supporting hyperplanes:

\[ w^Tx + b = +1,\qquad w^Tx + b = -1 \]

Margin width:

\[ \text{margin} = \frac{2}{\|w\|} \]

So maximizing margin equals minimizing \(\|w\|\).

1.2A Margin Geometry¶

SVM Margin Geometry

Interpretation: - middle black line is decision boundary, - dashed green lines are supporting hyperplanes, - circled points are support vectors, - maximizing margin means maximizing distance to the support vectors.

1.3 Hard vs Soft Margin¶

Hard margin: zero classification violations (linearly separable data).
Soft margin: allows violations using slack variables for realistic data.

1.4 Role of KKT in SVM¶

KKT conditions provide optimality checks for SVM’s constrained problem: 1. stationarity 2. primal feasibility 3. dual feasibility 4. complementary slackness

1.5 Why Dual Form Is Important¶

Dual form: - introduces Lagrange multipliers \(\alpha_i\), - uses dot products between samples, - enables kernel trick, - yields sparse solution (support vectors only).

1.6 Kernel Trick Intuition¶

Nonlinear class boundaries in input space can become linear in transformed feature space. Kernel functions compute transformed-space dot products directly, avoiding explicit mapping.

1.6A Kernel Mapping Geometry¶

Kernel Trick Geometry

Left panel: classes are not linearly separable in original 2D space.
Right panel: after feature mapping, a linear separator exists.

SECTION 2: FORMULAE¶

2.1 Hard-Margin Primal¶

\[ \min_{w,b}\;\frac12\|w\|^2 \]

subject to

\[ y_i(w^Tx_i+b)\ge1,\quad i=1,\dots,n \]

2.2 Soft-Margin Primal¶

\[ \min_{w,b,\xi}\;\frac12\|w\|^2 + C\sum_{i=1}^{n}\xi_i \]

subject to

\[ y_i(w^Tx_i+b)\ge1-\xi_i, \qquad \xi_i\ge0 \]

2.3 Lagrangian (Soft Margin)¶

\[ \mathcal{L}(w,b,\xi,\alpha,\mu)= \frac12\|w\|^2 + C\sum_i\xi_i -\sum_i\alpha_i[y_i(w^Tx_i+b)-1+\xi_i] -\sum_i\mu_i\xi_i \]

with

\[ \alpha_i\ge0,\;\mu_i\ge0 \]

2.4 Dual Optimization¶

\[ \max_{\alpha} \sum_{i=1}^{n}\alpha_i -\frac12\sum_{i,j=1}^{n}\alpha_i\alpha_j y_i y_j \langle x_i,x_j\rangle \]

subject to

\[ 0\le\alpha_i\le C, \qquad \sum_{i=1}^{n}\alpha_i y_i=0 \]

2.5 Prediction Function¶

\[ f(x)=\sum_{i\in SV}\alpha_i y_i K(x_i,x)+b \]

\[ \hat y=\operatorname{sign}(f(x)) \]

2.6 KKT Conditions (SVM Form)¶

Primal feasibility:

\[ y_i(w^Tx_i+b)-1+\xi_i\ge0, \quad \xi_i\ge0 \]

Dual feasibility:

\[ \alpha_i\ge0, \quad \mu_i\ge0 \]

Complementary slackness:

\[ \alpha_i\,[y_i(w^Tx_i+b)-1+\xi_i]=0, \qquad \mu_i\xi_i=0 \]

Stationarity gives:

\[ w=\sum_i\alpha_i y_i x_i, \quad \sum_i\alpha_i y_i=0, \quad \alpha_i+\mu_i=C \]

SECTION 3: SOLVED / DERIVATION-STYLE PROBLEMS¶

Problem 1: Why maximizing margin becomes minimizing \(\|w\|^2\)¶

Since margin = \(2/\|w\|\), maximizing margin is equivalent to minimizing \(\|w\|\), and for differentiability/convenience we minimize:

\[ \frac12\|w\|^2 \]

Problem 2: Build dual constraints from primal (soft margin)¶

From stationarity wrt \(\xi_i\):

\[ \frac{\partial \mathcal{L}}{\partial \xi_i}=C-\alpha_i-\mu_i=0 \Rightarrow \alpha_i\le C \]

with \(\alpha_i\ge0\), this yields:

\[ 0\le\alpha_i\le C \]

From stationarity wrt \(b\):

\[ \sum_i\alpha_i y_i=0 \]

These are the standard dual constraints.

Problem 3: Linear vs Nonlinear separability logic¶

If no linear separator exists in original space, use kernelized SVM:

\[ \langle x_i,x_j\rangle \rightarrow K(x_i,x_j) \]

The model remains linear in transformed space and nonlinear in input space.

SECTION 4: TRICKS & MENTAL MODELS¶

4.1 Two-Hyperplane Model¶

Think in three objects, not one: 1. class +1 support hyperplane, 2. class -1 support hyperplane, 3. middle decision hyperplane.

SVM maximizes distance between class hyperplanes.

4.2 Support Vector Insight¶

Only near-boundary points matter. Many training points can move without changing boundary if support vectors stay fixed.

4.3 Optimization Shortcut¶

When deriving, do not solve primal directly first. Write Lagrangian and move to dual quickly; kernelization becomes immediate.

4.4 Parameter Intuition¶

high \(C\): low training error tolerance, narrower regularization
low \(C\): more regularization, wider tolerance
high \(\gamma\) (RBF): very local boundaries
low \(\gamma\): smoother global boundaries

4.5 Practical Separation Check¶

If linear logistic regression performs strongly and boundary looks approximately linear, linear SVM is usually a strong baseline.

SECTION 5: IMPORTANT EXAM POINTS¶

SVM is a constrained quadratic optimization problem.
Hard margin assumes strict separability; soft margin handles violations via \(\xi_i\).
Objective \(\frac12\|w\|^2\) is equivalent to maximizing geometric margin.
KKT conditions are central to SVM derivation.
Dual constraints are: \(0\le\alpha_i\le C\), \(\sum_i\alpha_i y_i=0\).
Support vectors are points with nonzero \(\alpha_i\).
Kernel trick replaces dot product with \(K(x_i,x_j)\).
SVM remains linear in transformed space.
Decision function uses only support vectors.
Scaling features before SVM is practically mandatory.

Quick Revision Box¶

\[ \min_{w,b,\xi}\;\frac12\|w\|^2 + C\sum_i\xi_i \]

\[ y_i(w^Tx_i+b)\ge1-\xi_i,\;\xi_i\ge0 \]

\[ \max_{\alpha} \sum_i\alpha_i -\frac12\sum_{i,j}\alpha_i\alpha_jy_iy_jK(x_i,x_j) \]

\[ 0\le\alpha_i\le C, \quad \sum_i\alpha_i y_i=0 \]

\[ \hat y=\operatorname{sign}\left(\sum_{i\in SV}\alpha_i y_i K(x_i,x)+b\right) \]

SVM Mathematical Foundations¶

SECTION 1: THEORY CONCEPTS - DEFINITIONS & RELATIONSHIPS¶

1.1 SVM as Quadratic Programming¶

1.2 Decision Boundary and Margin¶

1.2A Margin Geometry¶

1.3 Hard vs Soft Margin¶

1.4 Role of KKT in SVM¶

1.5 Why Dual Form Is Important¶

1.6 Kernel Trick Intuition¶

1.6A Kernel Mapping Geometry¶

SECTION 2: FORMULAE¶

2.1 Hard-Margin Primal¶

2.2 Soft-Margin Primal¶

2.3 Lagrangian (Soft Margin)¶

2.4 Dual Optimization¶

2.5 Prediction Function¶

2.6 KKT Conditions (SVM Form)¶

SECTION 3: SOLVED / DERIVATION-STYLE PROBLEMS¶

Problem 1: Why maximizing margin becomes minimizing \(\|w\|^2\)¶

Problem 2: Build dual constraints from primal (soft margin)¶

Problem 3: Linear vs Nonlinear separability logic¶

SECTION 4: TRICKS & MENTAL MODELS¶

4.1 Two-Hyperplane Model¶

4.2 Support Vector Insight¶

4.3 Optimization Shortcut¶

4.4 Parameter Intuition¶

4.5 Practical Separation Check¶

SECTION 5: IMPORTANT EXAM POINTS¶

Quick Revision Box¶

References¶