Skip to content

Principal Component Analysis (PCA)

PCA is a linear dimensionality reduction method that transforms correlated features into orthogonal components while preserving maximum possible variance.


1. Core Definitions

1.1 Principal Component

A principal component is a direction in feature space (a unit vector) along which projected data variance is extreme (maximum for PC1, next maximum for PC2, etc.).

1.2 Variance

Variance measures spread around the mean. In PCA, it quantifies how much information is present along a direction.

1.3 Covariance Matrix

For centered data \(X_c \in \mathbb{R}^{N\times D}\):

\[ C = \frac{X_c^T X_c}{N-1} \]
  • \(C\) is \(D \times D\), square and symmetric.
  • Diagonal entries are feature variances.
  • Off-diagonal entries are feature covariances.

1.4 Eigenvector and Eigenvalue

If \(CW = \lambda W\), then: - \(W\): eigenvector (direction) - \(\lambda\): eigenvalue (variance captured along that direction in PCA)

1.5 Projection Score

For sample \(x_i\), projection score on direction \(W\):

\[ z_i = x_i^T W \]

For all samples:

\[ Z = X_c W \]

These scores are coordinates of samples in the new PCA axis system.


2. Geometric Interpretation

2.1 Geometric Meaning of Variance

Variance along a direction is the average squared length of orthogonal shadows (projections) of points onto that direction.

  • Large variance: points are spread far apart along that axis.
  • Small variance: points are tightly packed along that axis.

PCA finds a rotation of axes where the first rotated axis sees maximum spread.

2.2 Geometric Meaning of Projection Scores

Each score \(z_i\) is a signed distance of point \(x_i\) from origin along the component direction.

  • Positive score: point lies in direction of \(W\)
  • Negative score: point lies opposite to \(W\)
  • Magnitude \(|z_i|\): how far along that axis

So PCA is a coordinate change from original axes to orthogonal component axes.


3. Mathematical Derivation of PCA

Let \(W\) be a unit direction:

\[ W^T W = 1 \]

Projected vector:

\[ Z = X_c W \]

Projected variance:

\[ \operatorname{Var}(Z) = \frac{1}{N-1} Z^T Z = \frac{1}{N-1}(X_cW)^T(X_cW) = W^T\left(\frac{X_c^T X_c}{N-1}\right)W = W^T C W \]

Optimization problem:

\[ \max_W \; W^T C W \quad \text{s.t.} \quad W^T W = 1 \]

Lagrangian:

\[ \mathcal{L}(W,\lambda)=W^T C W + \lambda(W^TW-1) \]

Stationarity:

\[ \frac{\partial \mathcal{L}}{\partial W}=2CW-2\lambda W=0 \Rightarrow CW=\lambda W \]

Hence principal directions are eigenvectors of \(C\).

Also,

\[ W^T C W = W^T(\lambda W)=\lambda(W^TW)=\lambda \]

So variance along a principal direction equals its eigenvalue.


4. Total Variance and Its Relation to Eigenvalues

4.1 Total Variance in Original Space

Total variance of a centered dataset equals sum of feature variances:

\[ \text{Total Variance} = \sum_{j=1}^{D} \operatorname{Var}(X_j) \]

Using covariance matrix:

\[ \text{Total Variance} = \operatorname{tr}(C) \]

where \(\operatorname{tr}(C)\) is the trace (sum of diagonal entries).

4.2 Relation to Eigenvalues

For symmetric \(C\), trace equals sum of eigenvalues:

\[ \operatorname{tr}(C)=\sum_{j=1}^{D}\lambda_j \]

Therefore:

\[ \text{Total Variance} = \sum_{j=1}^{D}\lambda_j \]

This is why explained variance ratio is:

\[ \text{PVE}_k = \frac{\lambda_k}{\sum_{j=1}^{D}\lambda_j} \]

and cumulative explained variance for first \(k\) components is:

\[ \text{CumVar}(k)=\frac{\sum_{i=1}^{k}\lambda_i}{\sum_{j=1}^{D}\lambda_j} \]

5. PCA Through Eigendecomposition (Classical Route)

  1. Mean-center: \(X_c = X-\bar{X}\)
  2. Compute covariance: \(C=X_c^TX_c/(N-1)\)
  3. Solve \(CW=\lambda W\)
  4. Sort eigenvalues descending
  5. Keep top \(k\) eigenvectors \(W_k\)
  6. Compute scores: \(Z=X_cW_k\)

5.1 What Each Step Is Technically Doing

  • Mean-centering removes location bias so PCA captures spread, not mean offset.
  • Covariance computation converts raw coordinates into pairwise second-order structure.
  • Eigendecomposition finds directions that diagonalize covariance (decorrelated axes).
  • Sorting eigenpairs ranks directions by information content (variance captured).
  • Truncation to \(k\) performs controlled compression.
  • Projection maps original samples into compact latent coordinates.

6. PCA Through SVD

For centered matrix \(X_c\):

\[ X_c = U\Sigma V^T \]
  • Columns of \(V\): principal directions (same as covariance eigenvectors)
  • Singular values: \(\sigma_i\)
  • PCA eigenvalues:
\[ \lambda_i = \frac{\sigma_i^2}{N-1} \]
  • PCA scores (component coordinates):
\[ Z = X_c V = U\Sigma \]

So SVD gives both component directions and scores directly, often more numerically stable than forming \(C\) explicitly.


7. Numerical Example 1 (SVD, Rank-1 Data)

Given:

\[ X = \begin{bmatrix} 1 & 1 \\ 2 & 3 \\ 3 & 5 \end{bmatrix} \]

Centered:

\[ X_c=\begin{bmatrix} -1 & -2 \\ 0 & 0 \\ 1 & 2 \end{bmatrix} \]

Compute:

\[ X_c^T X_c = \begin{bmatrix}2&4\\4&8\end{bmatrix} \]

Eigenvalues of \(X_c^TX_c\): \(10, 0\)

So singular values are:

\[ \sigma_1=\sqrt{10},\quad \sigma_2=0 \]

With \(N=3\), covariance eigenvalues are:

\[ \lambda_1=\frac{10}{2}=5,\quad \lambda_2=\frac{0}{2}=0 \]

Explained variance:

\[ \text{PVE}_1=1,\quad \text{PVE}_2=0 \]

Interpretation: one principal axis captures all variance (data lies on a line).


8. Numerical Example 2 (SVD, Isotropic Spread)

Take centered data:

\[ X_c = \begin{bmatrix} 2 & 0 \\ 0 & 2 \\ -2 & 0 \\ 0 & -2 \end{bmatrix} \]

Here \(N=4\). Compute:

\[ X_c^T X_c = \begin{bmatrix}8&0\\0&8\end{bmatrix} \]

Singular values:

\[ \sigma_1=\sigma_2=\sqrt{8}=2\sqrt{2} \]

Covariance eigenvalues:

\[ \lambda_1=\lambda_2=\frac{8}{3} \]

Explained variance:

\[ \text{PVE}_1=\text{PVE}_2=\frac{1}{2} \]

Interpretation: variance is equally distributed across both axes; no strong single dominant component.


9. Projection Scores and Reconstruction

9.1 Scores for Top \(k\) Components

If \(W_k\in\mathbb{R}^{D\times k}\) contains top \(k\) component directions:

\[ Z_k = X_c W_k \]

Each row of \(Z_k\) is the new coordinate of a sample in reduced space.

9.2 Geometric Interpretation

  • PCA rotates axes to an orthogonal basis.
  • Scores are coordinates in rotated space.
  • Keeping top \(k\) means dropping low-variance axes.

9.3 Approximate Reconstruction

From reduced coordinates:

\[ \hat{X}_c = Z_k W_k^T \]

Add mean back:

\[ \hat{X} = \hat{X}_c + \bar{X} \]

Reconstruction error comes from discarded components.

9.4 Reconstruction Error and Discarded Eigenvalues

If all \(D\) components are used, reconstruction is exact (for centered data).
If only top \(k\) components are kept, the minimum mean-squared reconstruction error is governed by discarded variance:

\[ \text{Error} \propto \sum_{j=k+1}^{D}\lambda_j \]

So eigenvalues directly quantify information loss from dimensionality reduction.


10. Practical Notes

10.1 PCA vs Pairwise Correlation Removal

Aspect PCA Pairwise Correlation Removal
Method Global orthogonal transform Drops one feature in correlated pairs
Correlation handling Across all features jointly Pairwise only
Interpretability Lower Higher
Variance control Explicit via eigenvalues Not explicit

10.2 When PCA Helps

  • Strong multicollinearity
  • High-dimensional inputs
  • Need compact, decorrelated representation

10.3 When PCA Helps Less

  • Eigenvalues are nearly equal
  • Interpretability is critical

10.4 Important Property

PCA is unsupervised: it uses only input \(X\), not target \(Y\).

10.4A Clarification: PCA Transforms Features, It Does Not Pick Original Columns

After PCA, modeling is done on transformed components \((Z_1, Z_2, \dots)\), not directly on original features \((X_1, X_2, \dots)\).
This is a key conceptual point: - PCA is projection to a new basis, - it is not simple column dropping.

10.4B Interpretation Trade-off in Business Terms

Without PCA, you can directly explain outcomes using original variables (for example, sales vs TV spend).
After PCA, model inputs become \(Z_1, Z_2, \dots\), which are mixtures of original variables.
Accuracy may improve, but direct business interpretability can reduce.

10.5 Standardization Before PCA (Important in Practice)

If features are on different scales (for example, age vs income), high-scale features can dominate covariance.
In such cases, apply standardization first:

\[ X_{\text{std},j}=\frac{X_j-\mu_j}{s_j} \]

Then run PCA on standardized data (equivalent to PCA on correlation structure).

10.6 Limitations and Failure Modes

  • Outlier sensitivity: variance is second-moment based, so outliers can rotate components.
  • Linear assumption: PCA captures linear structure; nonlinear manifolds may require kernel PCA or manifold methods.
  • No target awareness: high-variance directions are not always most predictive for \(Y\).
  • Interpretability loss: components are mixtures of original variables.

10.7 PCA in a Modeling Pipeline

Typical supervised pipeline: 1. Split train/test first. 2. Fit scaler on train only. 3. Fit PCA on train only. 4. Transform train and test using same fitted objects. 5. Train downstream model on PCA features.

This avoids data leakage and preserves fair evaluation.

10.8 Choosing Number of Components \(k\)

Use one of these practical rules: - Explained variance threshold: smallest \(k\) with cumulative variance \(\ge 95\%\) (or \(99\%\)). - Elbow method: pick \(k\) at scree-plot bend. - Task-validated \(k\): tune \(k\) by downstream validation metric.

Best practice is to combine explained variance with validation performance.

10.9 When PCA Reduction Will Be Weak

If eigenvalues are close to each other (no clear dominance), variance is distributed across many directions.
In this case, aggressive reduction can remove useful information and PCA may not reduce dimensions meaningfully.

10.10 Why Covariance Route Works Even When \(X\) Is Not Square

Original data matrix \(X\) is usually \(N \times D\) with \(N \gg D\), so \(X\) is not square.
PCA uses covariance:

\[ C = \frac{X_c^T X_c}{N-1} \]

which is always \(D \times D\), square and symmetric, so eigendecomposition is valid.

10.11 Correlation Filtering vs PCA (Practical Decision)

Pairwise correlation filtering: - removes one variable at a time, - is manual and combinatorial with many features, - preserves original feature names.

PCA: - handles all features jointly in one global transformation, - yields uncorrelated components by construction, - can reduce dimensions more systematically.

Use correlation filtering when interpretability dominates; use PCA when compactness/decorrelation dominates.

10.12 Apply PCA on Full Feature Set, Not Only “Known Correlated” Subset

If the goal is global decorrelation and optimal variance capture, run PCA on the full designed feature set.
Selective pre-subsetting can miss cross-feature structure.


11. Real-World Examples from Research

11.1 Face Recognition (Eigenfaces)

In Turk and Pentland (1991), faces are projected onto a PCA subspace ("eigenfaces").
Core idea: - high-dimensional image pixels are compressed to a small set of principal components, - recognition is done in the compressed score space, - this reduces computation while preserving discriminative variation.

11.2 Genomics / Gene Expression Analysis

Ringner (Nature Biotechnology, 2008) explains PCA in genome-wide expression studies where thousands of genes are measured per sample.
Core value: - reveals dominant biological variation, - helps detect structure (subtypes, batch effects, outliers), - provides low-dimensional representations for downstream modeling.

11.3 Historical Foundation

PCA formalization in Hotelling (1933) established the principal-component eigenvalue framework used in modern ML pipelines.


12. Practical Implementation in ML Models

12.1 Typical Supervised Pipeline with PCA

  1. Split data into train/test.
  2. Fit scaler on train.
  3. Fit PCA on scaled train.
  4. Transform train/test.
  5. Train classifier/regressor on PCA features.
  6. Tune \(k\) (number of components) by validation.

12.2 Example: PCA + Logistic Regression

The official scikit-learn example "Pipelining: chaining a PCA and a logistic regression" uses Pipeline + GridSearchCV to jointly tune model regularization and PCA dimensionality.

from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

pipe = Pipeline([
    ("scaler", StandardScaler()),
    ("pca", PCA()),
    ("logistic", LogisticRegression(max_iter=1000))
])

param_grid = {
    "pca__n_components": [10, 20, 40, 60],
    "logistic__C": [0.01, 0.1, 1, 10]
}

search = GridSearchCV(pipe, param_grid, cv=5, n_jobs=-1)
search.fit(X_train, y_train)

12.3 Model Types Where PCA Is Commonly Used

  • Linear models: logistic/linear regression with multicollinear features.
  • Distance-based models: KNN or clustering where noisy dimensions hurt distances.
  • Margin models: SVM on high-dimensional continuous features.

12.4 Practical Quality Checks

After fitting PCA in a model pipeline, always inspect: - cumulative explained variance, - cross-validated task metric (accuracy/F1/RMSE), - stability across random splits, - whether PCA improves generalization vs no-PCA baseline.


13. Exam-Oriented Summary

  1. Write objective: \(\max_W W^T C W\), \(W^TW=1\)
  2. Use Lagrangian and derive \(CW=\lambda W\)
  3. State: variance on PC \(=\lambda\)
  4. State: total variance \(=\operatorname{tr}(C)=\sum\lambda_i\)
  5. Use explained variance formulas for component selection
  6. Write projection scores formula \(Z=X_cW_k\)
  7. Mention SVD relation \(X_c=U\Sigma V^T\), \(\lambda_i=\sigma_i^2/(N-1)\)
  8. Mention orthogonality and interpretability trade-off
  9. Mention that covariance matrix is square even if original data matrix is rectangular.
  10. Mention that PCA is projection-based transformation, not raw feature deletion.

14. Formula Sheet

\[ X_c = X-\bar{X} \]
\[ C = \frac{X_c^T X_c}{N-1} \]
\[ \max_W W^TCW \;\text{s.t.}\; W^TW=1 \]
\[ CW=\lambda W \]
\[ \text{Total Variance}=\operatorname{tr}(C)=\sum_j\lambda_j \]
\[ \text{PVE}_k=\frac{\lambda_k}{\sum_j\lambda_j} \]
\[ X_c=U\Sigma V^T,\quad \lambda_i=\frac{\sigma_i^2}{N-1} \]
\[ Z_k=X_cW_k \]

15. Advanced Implementation, Pseudocode, and Smart Tricks

15.1 High-Value Implementation Points

  • PCA centers input by default but does not scale features. Scale first when feature units differ significantly.
  • n_components can be:
  • integer (fixed number of components),
  • float in \((0,1)\) for explained-variance target,
  • 'mle' (dimension estimated from data, with compatible solver settings).
  • whiten=True makes transformed components unit-variance and uncorrelated; this can help some downstream models but removes relative variance scale information.
  • Solver choice matters:
  • full for exact SVD,
  • randomized for large matrices / faster approximate decomposition,
  • arpack for truncated decomposition with strict component limits.
  • For sparse high-dimensional text-like data, TruncatedSVD is often preferred over dense PCA workflows.

15.2 Practical Modeling Insights

  • Use PCA inside a Pipeline with scaler and model so train/test transforms are consistent.
  • Choose n_components via cross-validation rather than fixing arbitrarily.
  • Compare baseline model (no PCA) vs PCA model using same CV protocol.
  • For production inference: transform new raw data using the same fitted scaler + PCA object before prediction.
  • Use low-dimensional PCA projections and scree plots as quick diagnostics before selecting final models.
  • PCA can already reduce redundancy-driven overfitting risk; still validate with regularization/hyperparameter tuning on the downstream model.

15.3 Mathematics-for-ML Perspective

From a mathematical view, PCA is the orthogonal projection of data onto a lower-dimensional principal subspace that maximizes retained variance.
Equivalent viewpoints: - eigendecomposition of covariance matrix, - SVD of centered data matrix.

This equivalence is the bridge between linear algebra theory and practical ML implementation.

15.4 Pseudocode: PCA from Scratch (Matrix Route)

Input: X (N x D), target components k
1. Compute feature means mu (1 x D)
2. Center data: Xc = X - mu
3. Covariance: C = (Xc^T Xc) / (N - 1)
4. Eigendecompose C -> (lambda_i, w_i)
5. Sort eigenpairs by lambda_i descending
6. Keep first k vectors: Wk = [w_1 ... w_k]
7. Project: Z = Xc Wk
Output: Z, Wk, lambda_1...lambda_k, mu

15.5 Pseudocode: PCA in an ML Pipeline

Input: train data (X_train, y_train), test data X_test
1. Fit scaler on X_train
2. Transform X_train, X_test with same scaler
3. Fit PCA on scaled X_train
4. Transform scaled X_train, X_test using PCA
5. Fit model on transformed X_train
6. Evaluate on transformed X_test
7. Tune k using cross-validation
Output: tuned pipeline and evaluation metrics

15.6 Smart Technical Tricks

  • Quick component count rule: choose smallest \(k\) such that cumulative explained variance \(\ge 0.95\).
  • 2x2 exam shortcut: for covariance \(C=\begin{bmatrix}a&b\\b&d\end{bmatrix}\), eigenvalues are roots of [ \lambda^2-(a+d)\lambda+(ad-b^2)=0 ] Use this to compute PC variance quickly.
  • Redundancy detection: if one centered feature is scalar multiple of another, one eigenvalue becomes 0.
  • Numerical stability trick: use SVD route for large/ill-conditioned data.
  • Interpretability trick: inspect loadings (component coefficients) and sign/magnitude patterns before discarding original-space interpretation entirely.

15.7 Additional Practical Examples

Example A: Streaming / large batches with Incremental PCA

from sklearn.decomposition import IncrementalPCA

ipca = IncrementalPCA(n_components=50, batch_size=512)
for X_batch in stream_train_batches():
    ipca.partial_fit(X_batch)

X_train_pca = ipca.transform(X_train)
X_test_pca = ipca.transform(X_test)

Example B: Sparse high-dimensional features

from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components=300, random_state=42)
X_train_reduced = svd.fit_transform(X_train_sparse)
X_test_reduced = svd.transform(X_test_sparse)

15.8 What to Check Before Declaring PCA “Successful”

  1. Is train/test leakage avoided?
  2. Did validation metric improve over baseline?
  3. Is chosen \(k\) stable across folds?
  4. Is variance retained sufficient for task requirements?
  5. Is interpretability loss acceptable for the use case?

16. PCA Summary: Why, What It Solves, and Next Steps

16.1 Why We Do PCA

PCA transforms correlated, high-dimensional features into a compact orthogonal representation that preserves major variation.

16.2 What It Solves

PCA reduces: - multicollinearity and redundant features, - noise in low-variance directions, - computational burden in high-dimensional spaces.

Formally, it computes the best low-rank linear approximation of centered data under squared reconstruction error.

16.3 What It Improves

  • faster training,
  • better numerical conditioning,
  • often improved generalization in high-dimensional settings,
  • easier visualization (2D/3D score plots).

16.4 What It Does Not Solve

  • nonlinear manifolds (without kernel/nonlinear extensions),
  • target-aware feature selection (PCA is unsupervised),
  • interpretability of original variables.
  • all overfitting causes by itself (model capacity and data size still matter).

16.5 Next Steps After PCA

  1. Compare baseline model vs PCA model on validation/test metrics.
  2. Choose \(k\) by both explained variance and downstream metric.
  3. Inspect loadings to understand dominant feature mixtures.
  4. If linear PCA is insufficient, evaluate kernel PCA or autoencoders.

17. References for Added Practical Points