Skip to content

SVM Mathematical Foundations

This note presents SVM from a mathematics-first perspective: constrained optimization, KKT conditions, primal-dual formulation, and kernelized extension.


SECTION 1: THEORY CONCEPTS - DEFINITIONS & RELATIONSHIPS

1.1 SVM as Quadratic Programming

SVM is a constrained quadratic optimization problem: - objective is quadratic in \(w\), - constraints are linear in \((w,b)\).

1.2 Decision Boundary and Margin

Decision boundary:

\[ w^Tx + b = 0 \]

Supporting hyperplanes:

\[ w^Tx + b = +1,\qquad w^Tx + b = -1 \]

Margin width:

\[ \text{margin} = \frac{2}{\|w\|} \]

So maximizing margin equals minimizing \(\|w\|\).

1.2A Margin Geometry

SVM Margin Geometry

Interpretation: - middle black line is decision boundary, - dashed green lines are supporting hyperplanes, - circled points are support vectors, - maximizing margin means maximizing distance to the support vectors.

1.3 Hard vs Soft Margin

  • Hard margin: zero classification violations (linearly separable data).
  • Soft margin: allows violations using slack variables for realistic data.

1.4 Role of KKT in SVM

KKT conditions provide optimality checks for SVM’s constrained problem: 1. stationarity 2. primal feasibility 3. dual feasibility 4. complementary slackness

1.5 Why Dual Form Is Important

Dual form: - introduces Lagrange multipliers \(\alpha_i\), - uses dot products between samples, - enables kernel trick, - yields sparse solution (support vectors only).

1.6 Kernel Trick Intuition

Nonlinear class boundaries in input space can become linear in transformed feature space. Kernel functions compute transformed-space dot products directly, avoiding explicit mapping.

1.6A Kernel Mapping Geometry

Kernel Trick Geometry

Left panel: classes are not linearly separable in original 2D space.
Right panel: after feature mapping, a linear separator exists.


SECTION 2: FORMULAE

2.1 Hard-Margin Primal

\[ \min_{w,b}\;\frac12\|w\|^2 \]

subject to

\[ y_i(w^Tx_i+b)\ge1,\quad i=1,\dots,n \]

2.2 Soft-Margin Primal

\[ \min_{w,b,\xi}\;\frac12\|w\|^2 + C\sum_{i=1}^{n}\xi_i \]

subject to

\[ y_i(w^Tx_i+b)\ge1-\xi_i, \qquad \xi_i\ge0 \]

2.3 Lagrangian (Soft Margin)

\[ \mathcal{L}(w,b,\xi,\alpha,\mu)= \frac12\|w\|^2 + C\sum_i\xi_i -\sum_i\alpha_i[y_i(w^Tx_i+b)-1+\xi_i] -\sum_i\mu_i\xi_i \]

with

\[ \alpha_i\ge0,\;\mu_i\ge0 \]

2.4 Dual Optimization

\[ \max_{\alpha} \sum_{i=1}^{n}\alpha_i -\frac12\sum_{i,j=1}^{n}\alpha_i\alpha_j y_i y_j \langle x_i,x_j\rangle \]

subject to

\[ 0\le\alpha_i\le C, \qquad \sum_{i=1}^{n}\alpha_i y_i=0 \]

2.5 Prediction Function

\[ f(x)=\sum_{i\in SV}\alpha_i y_i K(x_i,x)+b \]
\[ \hat y=\operatorname{sign}(f(x)) \]

2.6 KKT Conditions (SVM Form)

Primal feasibility:

\[ y_i(w^Tx_i+b)-1+\xi_i\ge0, \quad \xi_i\ge0 \]

Dual feasibility:

\[ \alpha_i\ge0, \quad \mu_i\ge0 \]

Complementary slackness:

\[ \alpha_i\,[y_i(w^Tx_i+b)-1+\xi_i]=0, \qquad \mu_i\xi_i=0 \]

Stationarity gives:

\[ w=\sum_i\alpha_i y_i x_i, \quad \sum_i\alpha_i y_i=0, \quad \alpha_i+\mu_i=C \]

SECTION 3: SOLVED / DERIVATION-STYLE PROBLEMS

Problem 1: Why maximizing margin becomes minimizing \(\|w\|^2\)

Since margin = \(2/\|w\|\), maximizing margin is equivalent to minimizing \(\|w\|\), and for differentiability/convenience we minimize:

\[ \frac12\|w\|^2 \]

Problem 2: Build dual constraints from primal (soft margin)

From stationarity wrt \(\xi_i\):

\[ \frac{\partial \mathcal{L}}{\partial \xi_i}=C-\alpha_i-\mu_i=0 \Rightarrow \alpha_i\le C \]

with \(\alpha_i\ge0\), this yields:

\[ 0\le\alpha_i\le C \]

From stationarity wrt \(b\):

\[ \sum_i\alpha_i y_i=0 \]

These are the standard dual constraints.

Problem 3: Linear vs Nonlinear separability logic

If no linear separator exists in original space, use kernelized SVM:

\[ \langle x_i,x_j\rangle \rightarrow K(x_i,x_j) \]

The model remains linear in transformed space and nonlinear in input space.


SECTION 4: TRICKS & MENTAL MODELS

4.1 Two-Hyperplane Model

Think in three objects, not one: 1. class +1 support hyperplane, 2. class -1 support hyperplane, 3. middle decision hyperplane.

SVM maximizes distance between class hyperplanes.

4.2 Support Vector Insight

Only near-boundary points matter. Many training points can move without changing boundary if support vectors stay fixed.

4.3 Optimization Shortcut

When deriving, do not solve primal directly first. Write Lagrangian and move to dual quickly; kernelization becomes immediate.

4.4 Parameter Intuition

  • high \(C\): low training error tolerance, narrower regularization
  • low \(C\): more regularization, wider tolerance
  • high \(\gamma\) (RBF): very local boundaries
  • low \(\gamma\): smoother global boundaries

4.5 Practical Separation Check

If linear logistic regression performs strongly and boundary looks approximately linear, linear SVM is usually a strong baseline.


SECTION 5: IMPORTANT EXAM POINTS

  1. SVM is a constrained quadratic optimization problem.
  2. Hard margin assumes strict separability; soft margin handles violations via \(\xi_i\).
  3. Objective \(\frac12\|w\|^2\) is equivalent to maximizing geometric margin.
  4. KKT conditions are central to SVM derivation.
  5. Dual constraints are: \(0\le\alpha_i\le C\), \(\sum_i\alpha_i y_i=0\).
  6. Support vectors are points with nonzero \(\alpha_i\).
  7. Kernel trick replaces dot product with \(K(x_i,x_j)\).
  8. SVM remains linear in transformed space.
  9. Decision function uses only support vectors.
  10. Scaling features before SVM is practically mandatory.

Quick Revision Box

\[ \min_{w,b,\xi}\;\frac12\|w\|^2 + C\sum_i\xi_i \]
\[ y_i(w^Tx_i+b)\ge1-\xi_i,\;\xi_i\ge0 \]
\[ \max_{\alpha} \sum_i\alpha_i -\frac12\sum_{i,j}\alpha_i\alpha_jy_iy_jK(x_i,x_j) \]
\[ 0\le\alpha_i\le C, \quad \sum_i\alpha_i y_i=0 \]
\[ \hat y=\operatorname{sign}\left(\sum_{i\in SV}\alpha_i y_i K(x_i,x)+b\right) \]

References