Skip to content

K-Means, Gaussian Mixture Models, and EM: Mathematical and Practical Notes

1. Why Clustering Matters

Clustering is unsupervised structure discovery. Given data points $$ X = {x_1, x_2, \dots, x_N}, \quad x_i \in \mathbb{R}^d $$ we seek groups such that points in the same group are similar and points in different groups are dissimilar.

In practice, clustering is used when labels are missing but decisions are still needed: - Customer segment design - Product catalog grouping - Compression and prototype learning - Preprocessing before supervised learning - Candidate anomaly detection


2. K-Means: Core Idea and Geometry

K-means represents each cluster by one centroid (mean vector). Every point belongs to exactly one cluster.

2.1 Objective Function

K-means minimizes within-cluster sum of squares (WCSS, also called inertia):

\[ \min_{\{S_j\}_{j=1}^{k}} \sum_{j=1}^{k} \sum_{x_i \in S_j} \|x_i - \mu_j\|_2^2 \]

Equivalent assignment form:

\[ J(c,\mu) = \sum_{i=1}^{N} \|x_i - \mu_{c_i}\|_2^2, \quad c_i \in \{1,\dots,k\} \]

Where: - \(c_i\): cluster index assigned to point \(x_i\) - \(\mu_j\): centroid of cluster \(j\)

2.2 Geometric Interpretation

For fixed centroids, space is partitioned into Voronoi cells. Each point is mapped to nearest centroid.

flowchart LR
  A["Data Space"] --> B["Choose k centroids"]
  B --> C["Nearest-centroid partition (Voronoi regions)"]
  C --> D["Update centroids to region means"]
  D --> C

Decision boundaries between centroids are perpendicular bisectors; therefore K-means naturally prefers roughly spherical/equal-variance groups in Euclidean space.


3. K-Means Algorithm as Coordinate Descent

K-means is alternating minimization over two variable blocks: - assignments \(c\) - centroids \(\mu\)

3.1 Assignment Step

\[ c_i \leftarrow \arg\min_j \|x_i - \mu_j\|_2^2 \]

3.2 Update Step

\[ \mu_j \leftarrow \frac{1}{|S_j|} \sum_{x_i \in S_j} x_i \]

3.3 Why the Mean Appears (Important Derivation)

For one cluster with points \(\{x_i\}\), minimize:

\[ f(\mu) = \sum_i \|x_i - \mu\|_2^2 \]

Set gradient to zero:

\[ \nabla_\mu f(\mu) = -2\sum_i (x_i-\mu)=0 \Rightarrow \mu = \frac{1}{n}\sum_i x_i \]

So centroid update is the exact optimizer under squared distance.

3.4 Convergence Property

Each step never increases \(J\). Since \(J \ge 0\), the sequence converges to a local optimum (not guaranteed global optimum).


4. Detailed Numerical Example 1 (2D, Full Iterations)

Data:

\[ (1,1),(1,2),(2,1),(8,8),(9,8),(8,9) \]

Set \(k=2\), initialize:

\[ \mu_1^{(0)}=(1,1), \quad \mu_2^{(0)}=(8,8) \]

Iteration 1: Assignment

Cluster 1: \((1,1),(1,2),(2,1)\)

Cluster 2: \((8,8),(9,8),(8,9)\)

Iteration 1: Centroid Update

\[ \mu_1^{(1)}=\left(\frac{4}{3},\frac{4}{3}\right), \quad \mu_2^{(1)}=\left(\frac{25}{3},\frac{25}{3}\right) \]

Assignments remain unchanged in next pass, so converged.

Final Inertia Calculation

For cluster 1: $$ |(1,1)-(4/3,4/3)|^2=2/9 $$ $$ |(1,2)-(4/3,4/3)|^2=5/9 $$ $$ |(2,1)-(4/3,4/3)|^2=5/9 $$ Sum \(=12/9=4/3\).

For cluster 2, symmetric result \(=4/3\).

Total: $$ J^* = \frac{8}{3} $$


5. Detailed Numerical Example 2 (1D, Bad Initialization Effect)

Data: $$ 1,2,3,10,11,12 $$ Set \(k=2\).

Case A (Good Initialization)

Initial centroids: \(\mu_1=2, \mu_2=11\)

Converges to: - Cluster A: \(1,2,3\), centroid \(2\) - Cluster B: \(10,11,12\), centroid \(11\)

Inertia: $$ (1-2)^2+(2-2)^2+(3-2)^2 + (10-11)^2+(11-11)^2+(12-11)^2 = 4 $$

Case B (Poor Initialization)

Initial centroids: \(\mu_1=1, \mu_2=3\)

Early steps can place both centroids near left points before one eventually jumps right; for more complex data this can trap the model in weaker local minima.

Key lesson: initialization quality strongly affects final result.


6. Initialization Strategy: Why k-means++ Helps

k-means++ picks initial centroids far apart with probability proportional to squared distance from current chosen centroids.

Benefits: - Better spread of initial centers - Lower expected inertia - Fewer poor local minima - Faster convergence in practice


7. Choosing the Number of Clusters \(k\)

7.1 Elbow Method

Compute inertia for \(k=1,2,\dots,k_{max}\), choose bend point where gain starts saturating.

flowchart LR
  A["k=1: very high inertia"] --> B["k increases: sharp drop"]
  B --> C["elbow region"]
  C --> D["after elbow: diminishing returns"]

7.2 Silhouette Score

For each point \(i\): $$ a_i = \text{mean distance to own cluster}, \quad b_i = \text{best mean distance to another cluster} $$ $$ s_i = \frac{b_i-a_i}{\max(a_i,b_i)} $$

Interpretation: - \(s_i\approx1\): well clustered - \(s_i\approx0\): on boundary - \(s_i<0\): likely wrong cluster

7.3 Model Selection in Practice

Use jointly: - inertia curve - silhouette trend - cluster stability across random seeds - domain interpretability


8. K-Means Edge Cases and Failure Modes

8.1 Non-Spherical Structure

If true clusters are moon-shaped, concentric, or elongated, K-means partitions incorrectly.

8.2 Unequal Density and Size

Small dense cluster near large sparse cluster is often absorbed into larger one.

8.3 Outlier Sensitivity

Single extreme point can pull centroid because means are non-robust.

8.4 Scale Sensitivity

Distance is scale dependent; always normalize features when units differ.

8.5 Empty Cluster

If a centroid gets no points: - Reinitialize to farthest point - Or split highest-variance cluster

8.6 High-Dimensional Sparse Data

Euclidean distance concentration degrades cluster signal. Consider dimensionality reduction or cosine-based alternatives.


9. Practical K-Means Pipeline

  1. Clean missing values and outliers
  2. Scale numeric features
  3. Run K-means++ with multiple initializations
  4. Select \(k\) via elbow + silhouette + stability
  5. Interpret centroids in original feature space
  6. Validate with downstream KPI lift or business utility

9.1 Pseudocode

Input: X, k, max_iter, tol
X <- scale(X)
mu <- kmeans_plus_plus_init(X, k)
repeat
  assign each x_i to nearest mu_j
  recompute each mu_j as mean of assigned points
  compute centroid_shift
until no assignment change or centroid_shift < tol
return labels, centroids, inertia

10. Gaussian Mixture Model (GMM): Probabilistic Clustering

K-means gives hard labels. GMM gives soft assignment probabilities.

10.1 Mixture Density

\[ p(x)=\sum_{k=1}^{K}\pi_k\,\mathcal{N}(x\mid\mu_k,\Sigma_k), \quad \pi_k\ge0, \quad \sum_{k=1}^{K}\pi_k=1 \]

Interpretation: - \(\pi_k\): prior probability of component \(k\) - \(\mu_k,\Sigma_k\): Gaussian shape/location

10.2 Geometry vs K-Means

  • K-means: spherical partition, hard boundaries
  • GMM: ellipses/hyperellipsoids, soft boundaries

GMM handles overlap naturally: one point can belong partly to multiple clusters.


11. EM Algorithm for GMM: Full Intuition

Direct maximization of $$ \log p(X)=\sum_{i=1}^{N}\log\left(\sum_{k=1}^{K}\pi_k\mathcal{N}(x_i\mid\mu_k,\Sigma_k)\right) $$ is difficult because of the log-sum coupling.

EM introduces latent component indicator \(z_i\) and alternates:

flowchart LR
  A["Initialize pi, mu, Sigma"] --> B["E-step: compute responsibilities gamma_ik"]
  B --> C["M-step: update pi, mu, Sigma using gamma_ik"]
  C --> D["Compute log-likelihood"]
  D --> E{"Converged?"}
  E -- "No" --> B
  E -- "Yes" --> F["Return parameters"]

11.1 E-step

\[ \gamma_{ik}=P(z_i=k\mid x_i)= \frac{\pi_k\,\mathcal{N}(x_i\mid\mu_k,\Sigma_k)}{\sum_{j=1}^{K}\pi_j\,\mathcal{N}(x_i\mid\mu_j,\Sigma_j)} \]

\(\gamma_{ik}\) is the soft cluster membership weight.

11.2 M-step

\[ N_k = \sum_{i=1}^{N}\gamma_{ik} $$ $$ \mu_k = \frac{1}{N_k}\sum_{i=1}^{N}\gamma_{ik}x_i $$ $$ \Sigma_k = \frac{1}{N_k}\sum_{i=1}^{N}\gamma_{ik}(x_i-\mu_k)(x_i-\mu_k)^T $$ $$ \pi_k = \frac{N_k}{N} \]

11.3 Convergence

EM monotonically increases data log-likelihood each iteration and converges to a local optimum.


12. Numerical Example: One EM Responsibility Calculation

For one point \(x=2\), assume two 1D components:

\[ \pi_1=0.6,\ \mu_1=1,\ \sigma_1^2=1 $$ $$ \pi_2=0.4,\ \mu_2=4,\ \sigma_2^2=1 \]

Given: $$ \mathcal{N}(2\mid1,1)=0.242, \quad \mathcal{N}(2\mid4,1)=0.054 $$

Then: $$ \gamma_{1}= \frac{0.6\cdot0.242}{0.6\cdot0.242 + 0.4\cdot0.054} \approx 0.87, \quad \gamma_2 \approx 0.13 $$

Meaning: point \(x=2\) mostly belongs to component 1, but assignment is uncertain (soft).


13. K-Means and GMM Relationship

K-means can be viewed as a limiting/special case intuition of GMM under strong constraints: - equal spherical covariance - hard assignment approximation - equal or simplified priors in practice

This is why K-means is fast but less expressive.


14. K-Means vs GMM vs KNN (Clarification)

  • K-means: unsupervised clustering, hard labels
  • GMM: unsupervised clustering, probabilistic soft labels
  • KNN: supervised nearest-neighbor prediction (classification/regression), not a clustering objective

If the task is unlabeled grouping, use K-means/GMM. If labels exist and nearest-neighbor decision boundaries are desired, use KNN.


15. Practical Use Cases with Model Choice

  1. Customer Segmentation
  2. Start: K-means baseline (fast, interpretable centroids)
  3. Upgrade: GMM when segments overlap

  4. Image Color Quantization

  5. K-means is strong and efficient

  6. Fraud/Risk Bucketing

  7. GMM useful when confidence of membership matters

  8. Speech/Signal Pattern Grouping

  9. GMM handles varying variance patterns better than K-means

16. Troubleshooting and Diagnostics

16.1 Symptoms -> Likely Causes

  • Unstable clusters across runs -> poor initialization or weak cluster structure
  • One giant cluster + many tiny ones -> wrong \(k\), scaling issues, outliers
  • Low silhouette for all \(k\) -> no strong cluster geometry in chosen features
  • Poor business interpretability -> features not aligned to target behavior

16.2 Mitigation Playbook

  • Scale features
  • Remove or cap outliers
  • Try PCA before clustering for noisy high dimensions
  • Increase n_init
  • Compare K-means against GMM and hierarchical baselines
  • Use stability score (ARI/NMI across bootstrap samples)

17. Implementation Notes for ML Systems

  • Track drift by monitoring centroid shift and cluster-size changes.
  • Refit schedule can be periodic or trigger-based.
  • For very large data, use mini-batch K-means.
  • For GMM, choose covariance type (full, diag, tied, spherical) by validation objective and data geometry.
  • Use BIC/AIC to compare GMM models with different \(k\) and covariance complexity.

18. Exam and Interview Quick Revision

  1. K-means minimizes WCSS using alternating assignment and mean update.
  2. Mean is optimal under squared-distance distortion.
  3. K-means converges monotonically to a local optimum.
  4. K-means is sensitive to initialization, scaling, and outliers.
  5. GMM defines data density as weighted Gaussian components.
  6. EM alternates responsibility estimation and parameter re-estimation.
  7. GMM provides soft membership and captures elliptical clusters.
  8. KNN is not a clustering algorithm.

19. References