K-Means, Gaussian Mixture Models, and EM: Mathematical and Practical Notes¶
1. Why Clustering Matters¶
Clustering is unsupervised structure discovery. Given data points $$ X = {x_1, x_2, \dots, x_N}, \quad x_i \in \mathbb{R}^d $$ we seek groups such that points in the same group are similar and points in different groups are dissimilar.
In practice, clustering is used when labels are missing but decisions are still needed: - Customer segment design - Product catalog grouping - Compression and prototype learning - Preprocessing before supervised learning - Candidate anomaly detection
2. K-Means: Core Idea and Geometry¶
K-means represents each cluster by one centroid (mean vector). Every point belongs to exactly one cluster.
2.1 Objective Function¶
K-means minimizes within-cluster sum of squares (WCSS, also called inertia):
Equivalent assignment form:
Where: - \(c_i\): cluster index assigned to point \(x_i\) - \(\mu_j\): centroid of cluster \(j\)
2.2 Geometric Interpretation¶
For fixed centroids, space is partitioned into Voronoi cells. Each point is mapped to nearest centroid.
flowchart LR
A["Data Space"] --> B["Choose k centroids"]
B --> C["Nearest-centroid partition (Voronoi regions)"]
C --> D["Update centroids to region means"]
D --> C
Decision boundaries between centroids are perpendicular bisectors; therefore K-means naturally prefers roughly spherical/equal-variance groups in Euclidean space.
3. K-Means Algorithm as Coordinate Descent¶
K-means is alternating minimization over two variable blocks: - assignments \(c\) - centroids \(\mu\)
3.1 Assignment Step¶
3.2 Update Step¶
3.3 Why the Mean Appears (Important Derivation)¶
For one cluster with points \(\{x_i\}\), minimize:
Set gradient to zero:
So centroid update is the exact optimizer under squared distance.
3.4 Convergence Property¶
Each step never increases \(J\). Since \(J \ge 0\), the sequence converges to a local optimum (not guaranteed global optimum).
4. Detailed Numerical Example 1 (2D, Full Iterations)¶
Data:
Set \(k=2\), initialize:
Iteration 1: Assignment¶
Cluster 1: \((1,1),(1,2),(2,1)\)
Cluster 2: \((8,8),(9,8),(8,9)\)
Iteration 1: Centroid Update¶
Assignments remain unchanged in next pass, so converged.
Final Inertia Calculation¶
For cluster 1: $$ |(1,1)-(4/3,4/3)|^2=2/9 $$ $$ |(1,2)-(4/3,4/3)|^2=5/9 $$ $$ |(2,1)-(4/3,4/3)|^2=5/9 $$ Sum \(=12/9=4/3\).
For cluster 2, symmetric result \(=4/3\).
Total: $$ J^* = \frac{8}{3} $$
5. Detailed Numerical Example 2 (1D, Bad Initialization Effect)¶
Data: $$ 1,2,3,10,11,12 $$ Set \(k=2\).
Case A (Good Initialization)¶
Initial centroids: \(\mu_1=2, \mu_2=11\)
Converges to: - Cluster A: \(1,2,3\), centroid \(2\) - Cluster B: \(10,11,12\), centroid \(11\)
Inertia: $$ (1-2)^2+(2-2)^2+(3-2)^2 + (10-11)^2+(11-11)^2+(12-11)^2 = 4 $$
Case B (Poor Initialization)¶
Initial centroids: \(\mu_1=1, \mu_2=3\)
Early steps can place both centroids near left points before one eventually jumps right; for more complex data this can trap the model in weaker local minima.
Key lesson: initialization quality strongly affects final result.
6. Initialization Strategy: Why k-means++ Helps¶
k-means++ picks initial centroids far apart with probability proportional to squared distance from current chosen centroids.
Benefits: - Better spread of initial centers - Lower expected inertia - Fewer poor local minima - Faster convergence in practice
7. Choosing the Number of Clusters \(k\)¶
7.1 Elbow Method¶
Compute inertia for \(k=1,2,\dots,k_{max}\), choose bend point where gain starts saturating.
flowchart LR
A["k=1: very high inertia"] --> B["k increases: sharp drop"]
B --> C["elbow region"]
C --> D["after elbow: diminishing returns"]
7.2 Silhouette Score¶
For each point \(i\): $$ a_i = \text{mean distance to own cluster}, \quad b_i = \text{best mean distance to another cluster} $$ $$ s_i = \frac{b_i-a_i}{\max(a_i,b_i)} $$
Interpretation: - \(s_i\approx1\): well clustered - \(s_i\approx0\): on boundary - \(s_i<0\): likely wrong cluster
7.3 Model Selection in Practice¶
Use jointly: - inertia curve - silhouette trend - cluster stability across random seeds - domain interpretability
8. K-Means Edge Cases and Failure Modes¶
8.1 Non-Spherical Structure¶
If true clusters are moon-shaped, concentric, or elongated, K-means partitions incorrectly.
8.2 Unequal Density and Size¶
Small dense cluster near large sparse cluster is often absorbed into larger one.
8.3 Outlier Sensitivity¶
Single extreme point can pull centroid because means are non-robust.
8.4 Scale Sensitivity¶
Distance is scale dependent; always normalize features when units differ.
8.5 Empty Cluster¶
If a centroid gets no points: - Reinitialize to farthest point - Or split highest-variance cluster
8.6 High-Dimensional Sparse Data¶
Euclidean distance concentration degrades cluster signal. Consider dimensionality reduction or cosine-based alternatives.
9. Practical K-Means Pipeline¶
- Clean missing values and outliers
- Scale numeric features
- Run K-means++ with multiple initializations
- Select \(k\) via elbow + silhouette + stability
- Interpret centroids in original feature space
- Validate with downstream KPI lift or business utility
9.1 Pseudocode¶
Input: X, k, max_iter, tol
X <- scale(X)
mu <- kmeans_plus_plus_init(X, k)
repeat
assign each x_i to nearest mu_j
recompute each mu_j as mean of assigned points
compute centroid_shift
until no assignment change or centroid_shift < tol
return labels, centroids, inertia
10. Gaussian Mixture Model (GMM): Probabilistic Clustering¶
K-means gives hard labels. GMM gives soft assignment probabilities.
10.1 Mixture Density¶
Interpretation: - \(\pi_k\): prior probability of component \(k\) - \(\mu_k,\Sigma_k\): Gaussian shape/location
10.2 Geometry vs K-Means¶
- K-means: spherical partition, hard boundaries
- GMM: ellipses/hyperellipsoids, soft boundaries
GMM handles overlap naturally: one point can belong partly to multiple clusters.
11. EM Algorithm for GMM: Full Intuition¶
Direct maximization of $$ \log p(X)=\sum_{i=1}^{N}\log\left(\sum_{k=1}^{K}\pi_k\mathcal{N}(x_i\mid\mu_k,\Sigma_k)\right) $$ is difficult because of the log-sum coupling.
EM introduces latent component indicator \(z_i\) and alternates:
flowchart LR
A["Initialize pi, mu, Sigma"] --> B["E-step: compute responsibilities gamma_ik"]
B --> C["M-step: update pi, mu, Sigma using gamma_ik"]
C --> D["Compute log-likelihood"]
D --> E{"Converged?"}
E -- "No" --> B
E -- "Yes" --> F["Return parameters"]
11.1 E-step¶
\(\gamma_{ik}\) is the soft cluster membership weight.
11.2 M-step¶
11.3 Convergence¶
EM monotonically increases data log-likelihood each iteration and converges to a local optimum.
12. Numerical Example: One EM Responsibility Calculation¶
For one point \(x=2\), assume two 1D components:
Given: $$ \mathcal{N}(2\mid1,1)=0.242, \quad \mathcal{N}(2\mid4,1)=0.054 $$
Then: $$ \gamma_{1}= \frac{0.6\cdot0.242}{0.6\cdot0.242 + 0.4\cdot0.054} \approx 0.87, \quad \gamma_2 \approx 0.13 $$
Meaning: point \(x=2\) mostly belongs to component 1, but assignment is uncertain (soft).
13. K-Means and GMM Relationship¶
K-means can be viewed as a limiting/special case intuition of GMM under strong constraints: - equal spherical covariance - hard assignment approximation - equal or simplified priors in practice
This is why K-means is fast but less expressive.
14. K-Means vs GMM vs KNN (Clarification)¶
- K-means: unsupervised clustering, hard labels
- GMM: unsupervised clustering, probabilistic soft labels
- KNN: supervised nearest-neighbor prediction (classification/regression), not a clustering objective
If the task is unlabeled grouping, use K-means/GMM. If labels exist and nearest-neighbor decision boundaries are desired, use KNN.
15. Practical Use Cases with Model Choice¶
- Customer Segmentation
- Start: K-means baseline (fast, interpretable centroids)
-
Upgrade: GMM when segments overlap
-
Image Color Quantization
-
K-means is strong and efficient
-
Fraud/Risk Bucketing
-
GMM useful when confidence of membership matters
-
Speech/Signal Pattern Grouping
- GMM handles varying variance patterns better than K-means
16. Troubleshooting and Diagnostics¶
16.1 Symptoms -> Likely Causes¶
- Unstable clusters across runs -> poor initialization or weak cluster structure
- One giant cluster + many tiny ones -> wrong \(k\), scaling issues, outliers
- Low silhouette for all \(k\) -> no strong cluster geometry in chosen features
- Poor business interpretability -> features not aligned to target behavior
16.2 Mitigation Playbook¶
- Scale features
- Remove or cap outliers
- Try PCA before clustering for noisy high dimensions
- Increase
n_init - Compare K-means against GMM and hierarchical baselines
- Use stability score (ARI/NMI across bootstrap samples)
17. Implementation Notes for ML Systems¶
- Track drift by monitoring centroid shift and cluster-size changes.
- Refit schedule can be periodic or trigger-based.
- For very large data, use mini-batch K-means.
- For GMM, choose covariance type (
full,diag,tied,spherical) by validation objective and data geometry. - Use BIC/AIC to compare GMM models with different \(k\) and covariance complexity.
18. Exam and Interview Quick Revision¶
- K-means minimizes WCSS using alternating assignment and mean update.
- Mean is optimal under squared-distance distortion.
- K-means converges monotonically to a local optimum.
- K-means is sensitive to initialization, scaling, and outliers.
- GMM defines data density as weighted Gaussian components.
- EM alternates responsibility estimation and parameter re-estimation.
- GMM provides soft membership and captures elliptical clusters.
- KNN is not a clustering algorithm.
19. References¶
- The Math Behind K-Means Clustering
- K-Means Clustering: A Deep Dive into Unsupervised Learning
- UvA Lecture Notes: k-Means Clustering
- Clustering using k-means and GMM
- GMM vs K-Means
- IITM GMM (EM) Slides (PDF)
- Expectation-Maximization Algorithm (EM)
- scikit-learn KMeans API
- scikit-learn GaussianMixture API