K-Means, Gaussian Mixture Models, and EM: Mathematical and Practical Notes¶

1. Why Clustering Matters¶

Clustering is unsupervised structure discovery. Given data points $$ X = {x_1, x_2, \dots, x_N}, \quad x_i \in \mathbb{R}^d $$ we seek groups such that points in the same group are similar and points in different groups are dissimilar.

In practice, clustering is used when labels are missing but decisions are still needed: - Customer segment design - Product catalog grouping - Compression and prototype learning - Preprocessing before supervised learning - Candidate anomaly detection

2. K-Means: Core Idea and Geometry¶

K-means represents each cluster by one centroid (mean vector). Every point belongs to exactly one cluster.

2.1 Objective Function¶

K-means minimizes within-cluster sum of squares (WCSS, also called inertia):

\[ \min_{\{S_j\}_{j=1}^{k}} \sum_{j=1}^{k} \sum_{x_i \in S_j} \|x_i - \mu_j\|_2^2 \]

Equivalent assignment form:

\[ J(c,\mu) = \sum_{i=1}^{N} \|x_i - \mu_{c_i}\|_2^2, \quad c_i \in \{1,\dots,k\} \]

Where: - $c_i$: cluster index assigned to point $x_i$ - $\mu_j$: centroid of cluster $j$

2.2 Geometric Interpretation¶

For fixed centroids, space is partitioned into Voronoi cells. Each point is mapped to nearest centroid.

flowchart LR
  A["Data Space"] --> B["Choose k centroids"]
  B --> C["Nearest-centroid partition (Voronoi regions)"]
  C --> D["Update centroids to region means"]
  D --> C

Decision boundaries between centroids are perpendicular bisectors; therefore K-means naturally prefers roughly spherical/equal-variance groups in Euclidean space.

3. K-Means Algorithm as Coordinate Descent¶

K-means is alternating minimization over two variable blocks: - assignments $c$ - centroids $\mu$

3.1 Assignment Step¶

\[ c_i \leftarrow \arg\min_j \|x_i - \mu_j\|_2^2 \]

3.2 Update Step¶

\[ \mu_j \leftarrow \frac{1}{|S_j|} \sum_{x_i \in S_j} x_i \]

3.3 Why the Mean Appears (Important Derivation)¶

For one cluster with points $\{x_i\}$, minimize:

\[ f(\mu) = \sum_i \|x_i - \mu\|_2^2 \]

Set gradient to zero:

\[ \nabla_\mu f(\mu) = -2\sum_i (x_i-\mu)=0 \Rightarrow \mu = \frac{1}{n}\sum_i x_i \]

So centroid update is the exact optimizer under squared distance.

3.4 Convergence Property¶

Each step never increases $J$. Since $J \ge 0$, the sequence converges to a local optimum (not guaranteed global optimum).

4. Detailed Numerical Example 1 (2D, Full Iterations)¶

Data:

\[ (1,1),(1,2),(2,1),(8,8),(9,8),(8,9) \]

Set $k=2$, initialize:

\[ \mu_1^{(0)}=(1,1), \quad \mu_2^{(0)}=(8,8) \]

Iteration 1: Assignment¶

Cluster 1: $(1,1),(1,2),(2,1)$

Cluster 2: $(8,8),(9,8),(8,9)$

Iteration 1: Centroid Update¶

\[ \mu_1^{(1)}=\left(\frac{4}{3},\frac{4}{3}\right), \quad \mu_2^{(1)}=\left(\frac{25}{3},\frac{25}{3}\right) \]

Assignments remain unchanged in next pass, so converged.

Final Inertia Calculation¶

For cluster 1: $$ |(1,1)-(4/3,4/3)|^2=2/9 $$ $$ |(1,2)-(4/3,4/3)|^2=5/9 $$ $$ |(2,1)-(4/3,4/3)|^2=5/9 $$ Sum $=12/9=4/3$.

For cluster 2, symmetric result $=4/3$.

Total: $$ J^* = \frac{8}{3} $$

5. Detailed Numerical Example 2 (1D, Bad Initialization Effect)¶

Data: $$ 1,2,3,10,11,12 $$ Set $k=2$.

Case A (Good Initialization)¶

Initial centroids: $\mu_1=2, \mu_2=11$

Converges to: - Cluster A: $1,2,3$, centroid $2$ - Cluster B: $10,11,12$, centroid $11$

Inertia: $$ (1-2)^2+(2-2)^2+(3-2)^2 + (10-11)^2+(11-11)^2+(12-11)^2 = 4 $$

Case B (Poor Initialization)¶

Initial centroids: $\mu_1=1, \mu_2=3$

Early steps can place both centroids near left points before one eventually jumps right; for more complex data this can trap the model in weaker local minima.

Key lesson: initialization quality strongly affects final result.

6. Initialization Strategy: Why k-means++ Helps¶

k-means++ picks initial centroids far apart with probability proportional to squared distance from current chosen centroids.

Benefits: - Better spread of initial centers - Lower expected inertia - Fewer poor local minima - Faster convergence in practice

7. Choosing the Number of Clusters $k$¶

7.1 Elbow Method¶

Compute inertia for $k=1,2,\dots,k_{max}$, choose bend point where gain starts saturating.

flowchart LR
  A["k=1: very high inertia"] --> B["k increases: sharp drop"]
  B --> C["elbow region"]
  C --> D["after elbow: diminishing returns"]

7.2 Silhouette Score¶

For each point $i$: $$ a_i = \text{mean distance to own cluster}, \quad b_i = \text{best mean distance to another cluster} $$ $$ s_i = \frac{b_i-a_i}{\max(a_i,b_i)} $$

Interpretation: - $s_i\approx1$: well clustered - $s_i\approx0$: on boundary - $s_i<0$: likely wrong cluster

7.3 Model Selection in Practice¶

Use jointly: - inertia curve - silhouette trend - cluster stability across random seeds - domain interpretability

8. K-Means Edge Cases and Failure Modes¶

8.1 Non-Spherical Structure¶

If true clusters are moon-shaped, concentric, or elongated, K-means partitions incorrectly.

8.2 Unequal Density and Size¶

Small dense cluster near large sparse cluster is often absorbed into larger one.

8.3 Outlier Sensitivity¶

Single extreme point can pull centroid because means are non-robust.

8.4 Scale Sensitivity¶

Distance is scale dependent; always normalize features when units differ.

8.5 Empty Cluster¶

If a centroid gets no points: - Reinitialize to farthest point - Or split highest-variance cluster

8.6 High-Dimensional Sparse Data¶

Euclidean distance concentration degrades cluster signal. Consider dimensionality reduction or cosine-based alternatives.

9. Practical K-Means Pipeline¶

Clean missing values and outliers
Scale numeric features
Run K-means++ with multiple initializations
Select $k$ via elbow + silhouette + stability
Interpret centroids in original feature space
Validate with downstream KPI lift or business utility

9.1 Pseudocode¶

Input: X, k, max_iter, tol
X <- scale(X)
mu <- kmeans_plus_plus_init(X, k)
repeat
  assign each x_i to nearest mu_j
  recompute each mu_j as mean of assigned points
  compute centroid_shift
until no assignment change or centroid_shift < tol
return labels, centroids, inertia

10. Gaussian Mixture Model (GMM): Probabilistic Clustering¶

K-means gives hard labels. GMM gives soft assignment probabilities.

10.1 Mixture Density¶

\[ p(x)=\sum_{k=1}^{K}\pi_k\,\mathcal{N}(x\mid\mu_k,\Sigma_k), \quad \pi_k\ge0, \quad \sum_{k=1}^{K}\pi_k=1 \]

Interpretation: - $\pi_k$: prior probability of component $k$ - $\mu_k,\Sigma_k$: Gaussian shape/location

10.2 Geometry vs K-Means¶

K-means: spherical partition, hard boundaries
GMM: ellipses/hyperellipsoids, soft boundaries

GMM handles overlap naturally: one point can belong partly to multiple clusters.

11. EM Algorithm for GMM: Full Intuition¶

Direct maximization of $$ \log p(X)=\sum_{i=1}^{N}\log\left(\sum_{k=1}^{K}\pi_k\mathcal{N}(x_i\mid\mu_k,\Sigma_k)\right) $$ is difficult because of the log-sum coupling.

EM introduces latent component indicator $z_i$ and alternates:

flowchart LR
  A["Initialize pi, mu, Sigma"] --> B["E-step: compute responsibilities gamma_ik"]
  B --> C["M-step: update pi, mu, Sigma using gamma_ik"]
  C --> D["Compute log-likelihood"]
  D --> E{"Converged?"}
  E -- "No" --> B
  E -- "Yes" --> F["Return parameters"]

11.1 E-step¶

\[ \gamma_{ik}=P(z_i=k\mid x_i)= \frac{\pi_k\,\mathcal{N}(x_i\mid\mu_k,\Sigma_k)}{\sum_{j=1}^{K}\pi_j\,\mathcal{N}(x_i\mid\mu_j,\Sigma_j)} \]

$\gamma_{ik}$ is the soft cluster membership weight.

11.2 M-step¶

\[ N_k = \sum_{i=1}^{N}\gamma_{ik} $$ $$ \mu_k = \frac{1}{N_k}\sum_{i=1}^{N}\gamma_{ik}x_i $$ $$ \Sigma_k = \frac{1}{N_k}\sum_{i=1}^{N}\gamma_{ik}(x_i-\mu_k)(x_i-\mu_k)^T $$ $$ \pi_k = \frac{N_k}{N} \]

11.3 Convergence¶

EM monotonically increases data log-likelihood each iteration and converges to a local optimum.

12. Numerical Example: One EM Responsibility Calculation¶

For one point $x=2$, assume two 1D components:

\[ \pi_1=0.6,\ \mu_1=1,\ \sigma_1^2=1 $$ $$ \pi_2=0.4,\ \mu_2=4,\ \sigma_2^2=1 \]

Given: $$ \mathcal{N}(2\mid1,1)=0.242, \quad \mathcal{N}(2\mid4,1)=0.054 $$

Then: $$ \gamma_{1}= \frac{0.6\cdot0.242}{0.6\cdot0.242 + 0.4\cdot0.054} \approx 0.87, \quad \gamma_2 \approx 0.13 $$

Meaning: point $x=2$ mostly belongs to component 1, but assignment is uncertain (soft).

13. K-Means and GMM Relationship¶

K-means can be viewed as a limiting/special case intuition of GMM under strong constraints: - equal spherical covariance - hard assignment approximation - equal or simplified priors in practice

This is why K-means is fast but less expressive.

14. K-Means vs GMM vs KNN (Clarification)¶

K-means: unsupervised clustering, hard labels
GMM: unsupervised clustering, probabilistic soft labels
KNN: supervised nearest-neighbor prediction (classification/regression), not a clustering objective

If the task is unlabeled grouping, use K-means/GMM. If labels exist and nearest-neighbor decision boundaries are desired, use KNN.

15. Practical Use Cases with Model Choice¶

Customer Segmentation
Start: K-means baseline (fast, interpretable centroids)
Upgrade: GMM when segments overlap
Image Color Quantization
K-means is strong and efficient
Fraud/Risk Bucketing
GMM useful when confidence of membership matters
Speech/Signal Pattern Grouping
GMM handles varying variance patterns better than K-means

16. Troubleshooting and Diagnostics¶

16.1 Symptoms -> Likely Causes¶

Unstable clusters across runs -> poor initialization or weak cluster structure
One giant cluster + many tiny ones -> wrong $k$, scaling issues, outliers
Low silhouette for all $k$ -> no strong cluster geometry in chosen features
Poor business interpretability -> features not aligned to target behavior

16.2 Mitigation Playbook¶

Scale features
Remove or cap outliers
Try PCA before clustering for noisy high dimensions
Increase n_init
Compare K-means against GMM and hierarchical baselines
Use stability score (ARI/NMI across bootstrap samples)

17. Implementation Notes for ML Systems¶

Track drift by monitoring centroid shift and cluster-size changes.
Refit schedule can be periodic or trigger-based.
For very large data, use mini-batch K-means.
For GMM, choose covariance type (full, diag, tied, spherical) by validation objective and data geometry.
Use BIC/AIC to compare GMM models with different $k$ and covariance complexity.

18. Exam and Interview Quick Revision¶

K-means minimizes WCSS using alternating assignment and mean update.
Mean is optimal under squared-distance distortion.
K-means converges monotonically to a local optimum.
K-means is sensitive to initialization, scaling, and outliers.
GMM defines data density as weighted Gaussian components.
EM alternates responsibility estimation and parameter re-estimation.
GMM provides soft membership and captures elliptical clusters.
KNN is not a clustering algorithm.