Principal Component Analysis (PCA)¶
PCA is a linear dimensionality reduction method that transforms correlated features into orthogonal components while preserving maximum possible variance.
1. Core Definitions¶
1.1 Principal Component¶
A principal component is a direction in feature space (a unit vector) along which projected data variance is extreme (maximum for PC1, next maximum for PC2, etc.).
1.2 Variance¶
Variance measures spread around the mean. In PCA, it quantifies how much information is present along a direction.
1.3 Covariance Matrix¶
For centered data \(X_c \in \mathbb{R}^{N\times D}\):
- \(C\) is \(D \times D\), square and symmetric.
- Diagonal entries are feature variances.
- Off-diagonal entries are feature covariances.
1.4 Eigenvector and Eigenvalue¶
If \(CW = \lambda W\), then: - \(W\): eigenvector (direction) - \(\lambda\): eigenvalue (variance captured along that direction in PCA)
1.5 Projection Score¶
For sample \(x_i\), projection score on direction \(W\):
For all samples:
These scores are coordinates of samples in the new PCA axis system.
2. Geometric Interpretation¶
2.1 Geometric Meaning of Variance¶
Variance along a direction is the average squared length of orthogonal shadows (projections) of points onto that direction.
- Large variance: points are spread far apart along that axis.
- Small variance: points are tightly packed along that axis.
PCA finds a rotation of axes where the first rotated axis sees maximum spread.
2.2 Geometric Meaning of Projection Scores¶
Each score \(z_i\) is a signed distance of point \(x_i\) from origin along the component direction.
- Positive score: point lies in direction of \(W\)
- Negative score: point lies opposite to \(W\)
- Magnitude \(|z_i|\): how far along that axis
So PCA is a coordinate change from original axes to orthogonal component axes.
3. Mathematical Derivation of PCA¶
Let \(W\) be a unit direction:
Projected vector:
Projected variance:
Optimization problem:
Lagrangian:
Stationarity:
Hence principal directions are eigenvectors of \(C\).
Also,
So variance along a principal direction equals its eigenvalue.
4. Total Variance and Its Relation to Eigenvalues¶
4.1 Total Variance in Original Space¶
Total variance of a centered dataset equals sum of feature variances:
Using covariance matrix:
where \(\operatorname{tr}(C)\) is the trace (sum of diagonal entries).
4.2 Relation to Eigenvalues¶
For symmetric \(C\), trace equals sum of eigenvalues:
Therefore:
This is why explained variance ratio is:
and cumulative explained variance for first \(k\) components is:
5. PCA Through Eigendecomposition (Classical Route)¶
- Mean-center: \(X_c = X-\bar{X}\)
- Compute covariance: \(C=X_c^TX_c/(N-1)\)
- Solve \(CW=\lambda W\)
- Sort eigenvalues descending
- Keep top \(k\) eigenvectors \(W_k\)
- Compute scores: \(Z=X_cW_k\)
5.1 What Each Step Is Technically Doing¶
- Mean-centering removes location bias so PCA captures spread, not mean offset.
- Covariance computation converts raw coordinates into pairwise second-order structure.
- Eigendecomposition finds directions that diagonalize covariance (decorrelated axes).
- Sorting eigenpairs ranks directions by information content (variance captured).
- Truncation to \(k\) performs controlled compression.
- Projection maps original samples into compact latent coordinates.
6. PCA Through SVD¶
For centered matrix \(X_c\):
- Columns of \(V\): principal directions (same as covariance eigenvectors)
- Singular values: \(\sigma_i\)
- PCA eigenvalues:
- PCA scores (component coordinates):
So SVD gives both component directions and scores directly, often more numerically stable than forming \(C\) explicitly.
7. Numerical Example 1 (SVD, Rank-1 Data)¶
Given:
Centered:
Compute:
Eigenvalues of \(X_c^TX_c\): \(10, 0\)
So singular values are:
With \(N=3\), covariance eigenvalues are:
Explained variance:
Interpretation: one principal axis captures all variance (data lies on a line).
8. Numerical Example 2 (SVD, Isotropic Spread)¶
Take centered data:
Here \(N=4\). Compute:
Singular values:
Covariance eigenvalues:
Explained variance:
Interpretation: variance is equally distributed across both axes; no strong single dominant component.
9. Projection Scores and Reconstruction¶
9.1 Scores for Top \(k\) Components¶
If \(W_k\in\mathbb{R}^{D\times k}\) contains top \(k\) component directions:
Each row of \(Z_k\) is the new coordinate of a sample in reduced space.
9.2 Geometric Interpretation¶
- PCA rotates axes to an orthogonal basis.
- Scores are coordinates in rotated space.
- Keeping top \(k\) means dropping low-variance axes.
9.3 Approximate Reconstruction¶
From reduced coordinates:
Add mean back:
Reconstruction error comes from discarded components.
9.4 Reconstruction Error and Discarded Eigenvalues¶
If all \(D\) components are used, reconstruction is exact (for centered data).
If only top \(k\) components are kept, the minimum mean-squared reconstruction error is governed by discarded variance:
So eigenvalues directly quantify information loss from dimensionality reduction.
10. Practical Notes¶
10.1 PCA vs Pairwise Correlation Removal¶
| Aspect | PCA | Pairwise Correlation Removal |
|---|---|---|
| Method | Global orthogonal transform | Drops one feature in correlated pairs |
| Correlation handling | Across all features jointly | Pairwise only |
| Interpretability | Lower | Higher |
| Variance control | Explicit via eigenvalues | Not explicit |
10.2 When PCA Helps¶
- Strong multicollinearity
- High-dimensional inputs
- Need compact, decorrelated representation
10.3 When PCA Helps Less¶
- Eigenvalues are nearly equal
- Interpretability is critical
10.4 Important Property¶
PCA is unsupervised: it uses only input \(X\), not target \(Y\).
10.4A Clarification: PCA Transforms Features, It Does Not Pick Original Columns¶
After PCA, modeling is done on transformed components \((Z_1, Z_2, \dots)\), not directly on original features \((X_1, X_2, \dots)\).
This is a key conceptual point:
- PCA is projection to a new basis,
- it is not simple column dropping.
10.4B Interpretation Trade-off in Business Terms¶
Without PCA, you can directly explain outcomes using original variables (for example, sales vs TV spend).
After PCA, model inputs become \(Z_1, Z_2, \dots\), which are mixtures of original variables.
Accuracy may improve, but direct business interpretability can reduce.
10.5 Standardization Before PCA (Important in Practice)¶
If features are on different scales (for example, age vs income), high-scale features can dominate covariance.
In such cases, apply standardization first:
Then run PCA on standardized data (equivalent to PCA on correlation structure).
10.6 Limitations and Failure Modes¶
- Outlier sensitivity: variance is second-moment based, so outliers can rotate components.
- Linear assumption: PCA captures linear structure; nonlinear manifolds may require kernel PCA or manifold methods.
- No target awareness: high-variance directions are not always most predictive for \(Y\).
- Interpretability loss: components are mixtures of original variables.
10.7 PCA in a Modeling Pipeline¶
Typical supervised pipeline: 1. Split train/test first. 2. Fit scaler on train only. 3. Fit PCA on train only. 4. Transform train and test using same fitted objects. 5. Train downstream model on PCA features.
This avoids data leakage and preserves fair evaluation.
10.8 Choosing Number of Components \(k\)¶
Use one of these practical rules: - Explained variance threshold: smallest \(k\) with cumulative variance \(\ge 95\%\) (or \(99\%\)). - Elbow method: pick \(k\) at scree-plot bend. - Task-validated \(k\): tune \(k\) by downstream validation metric.
Best practice is to combine explained variance with validation performance.
10.9 When PCA Reduction Will Be Weak¶
If eigenvalues are close to each other (no clear dominance), variance is distributed across many directions.
In this case, aggressive reduction can remove useful information and PCA may not reduce dimensions meaningfully.
10.10 Why Covariance Route Works Even When \(X\) Is Not Square¶
Original data matrix \(X\) is usually \(N \times D\) with \(N \gg D\), so \(X\) is not square.
PCA uses covariance:
which is always \(D \times D\), square and symmetric, so eigendecomposition is valid.
10.11 Correlation Filtering vs PCA (Practical Decision)¶
Pairwise correlation filtering: - removes one variable at a time, - is manual and combinatorial with many features, - preserves original feature names.
PCA: - handles all features jointly in one global transformation, - yields uncorrelated components by construction, - can reduce dimensions more systematically.
Use correlation filtering when interpretability dominates; use PCA when compactness/decorrelation dominates.
10.12 Apply PCA on Full Feature Set, Not Only “Known Correlated” Subset¶
If the goal is global decorrelation and optimal variance capture, run PCA on the full designed feature set.
Selective pre-subsetting can miss cross-feature structure.
11. Real-World Examples from Research¶
11.1 Face Recognition (Eigenfaces)¶
In Turk and Pentland (1991), faces are projected onto a PCA subspace ("eigenfaces").
Core idea:
- high-dimensional image pixels are compressed to a small set of principal components,
- recognition is done in the compressed score space,
- this reduces computation while preserving discriminative variation.
11.2 Genomics / Gene Expression Analysis¶
Ringner (Nature Biotechnology, 2008) explains PCA in genome-wide expression studies where thousands of genes are measured per sample.
Core value:
- reveals dominant biological variation,
- helps detect structure (subtypes, batch effects, outliers),
- provides low-dimensional representations for downstream modeling.
11.3 Historical Foundation¶
PCA formalization in Hotelling (1933) established the principal-component eigenvalue framework used in modern ML pipelines.
12. Practical Implementation in ML Models¶
12.1 Typical Supervised Pipeline with PCA¶
- Split data into train/test.
- Fit scaler on train.
- Fit PCA on scaled train.
- Transform train/test.
- Train classifier/regressor on PCA features.
- Tune \(k\) (number of components) by validation.
12.2 Example: PCA + Logistic Regression¶
The official scikit-learn example "Pipelining: chaining a PCA and a logistic regression" uses Pipeline + GridSearchCV to jointly tune model regularization and PCA dimensionality.
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
pipe = Pipeline([
("scaler", StandardScaler()),
("pca", PCA()),
("logistic", LogisticRegression(max_iter=1000))
])
param_grid = {
"pca__n_components": [10, 20, 40, 60],
"logistic__C": [0.01, 0.1, 1, 10]
}
search = GridSearchCV(pipe, param_grid, cv=5, n_jobs=-1)
search.fit(X_train, y_train)
12.3 Model Types Where PCA Is Commonly Used¶
- Linear models: logistic/linear regression with multicollinear features.
- Distance-based models: KNN or clustering where noisy dimensions hurt distances.
- Margin models: SVM on high-dimensional continuous features.
12.4 Practical Quality Checks¶
After fitting PCA in a model pipeline, always inspect: - cumulative explained variance, - cross-validated task metric (accuracy/F1/RMSE), - stability across random splits, - whether PCA improves generalization vs no-PCA baseline.
13. Exam-Oriented Summary¶
- Write objective: \(\max_W W^T C W\), \(W^TW=1\)
- Use Lagrangian and derive \(CW=\lambda W\)
- State: variance on PC \(=\lambda\)
- State: total variance \(=\operatorname{tr}(C)=\sum\lambda_i\)
- Use explained variance formulas for component selection
- Write projection scores formula \(Z=X_cW_k\)
- Mention SVD relation \(X_c=U\Sigma V^T\), \(\lambda_i=\sigma_i^2/(N-1)\)
- Mention orthogonality and interpretability trade-off
- Mention that covariance matrix is square even if original data matrix is rectangular.
- Mention that PCA is projection-based transformation, not raw feature deletion.
14. Formula Sheet¶
15. Advanced Implementation, Pseudocode, and Smart Tricks¶
15.1 High-Value Implementation Points¶
PCAcenters input by default but does not scale features. Scale first when feature units differ significantly.n_componentscan be:- integer (fixed number of components),
- float in \((0,1)\) for explained-variance target,
'mle'(dimension estimated from data, with compatible solver settings).whiten=Truemakes transformed components unit-variance and uncorrelated; this can help some downstream models but removes relative variance scale information.- Solver choice matters:
fullfor exact SVD,randomizedfor large matrices / faster approximate decomposition,arpackfor truncated decomposition with strict component limits.- For sparse high-dimensional text-like data,
TruncatedSVDis often preferred over dense PCA workflows.
15.2 Practical Modeling Insights¶
- Use PCA inside a
Pipelinewith scaler and model so train/test transforms are consistent. - Choose
n_componentsvia cross-validation rather than fixing arbitrarily. - Compare baseline model (no PCA) vs PCA model using same CV protocol.
- For production inference: transform new raw data using the same fitted scaler + PCA object before prediction.
- Use low-dimensional PCA projections and scree plots as quick diagnostics before selecting final models.
- PCA can already reduce redundancy-driven overfitting risk; still validate with regularization/hyperparameter tuning on the downstream model.
15.3 Mathematics-for-ML Perspective¶
From a mathematical view, PCA is the orthogonal projection of data onto a lower-dimensional principal subspace that maximizes retained variance.
Equivalent viewpoints:
- eigendecomposition of covariance matrix,
- SVD of centered data matrix.
This equivalence is the bridge between linear algebra theory and practical ML implementation.
15.4 Pseudocode: PCA from Scratch (Matrix Route)¶
Input: X (N x D), target components k
1. Compute feature means mu (1 x D)
2. Center data: Xc = X - mu
3. Covariance: C = (Xc^T Xc) / (N - 1)
4. Eigendecompose C -> (lambda_i, w_i)
5. Sort eigenpairs by lambda_i descending
6. Keep first k vectors: Wk = [w_1 ... w_k]
7. Project: Z = Xc Wk
Output: Z, Wk, lambda_1...lambda_k, mu
15.5 Pseudocode: PCA in an ML Pipeline¶
Input: train data (X_train, y_train), test data X_test
1. Fit scaler on X_train
2. Transform X_train, X_test with same scaler
3. Fit PCA on scaled X_train
4. Transform scaled X_train, X_test using PCA
5. Fit model on transformed X_train
6. Evaluate on transformed X_test
7. Tune k using cross-validation
Output: tuned pipeline and evaluation metrics
15.6 Smart Technical Tricks¶
- Quick component count rule: choose smallest \(k\) such that cumulative explained variance \(\ge 0.95\).
- 2x2 exam shortcut: for covariance \(C=\begin{bmatrix}a&b\\b&d\end{bmatrix}\), eigenvalues are roots of [ \lambda^2-(a+d)\lambda+(ad-b^2)=0 ] Use this to compute PC variance quickly.
- Redundancy detection: if one centered feature is scalar multiple of another, one eigenvalue becomes 0.
- Numerical stability trick: use SVD route for large/ill-conditioned data.
- Interpretability trick: inspect loadings (component coefficients) and sign/magnitude patterns before discarding original-space interpretation entirely.
15.7 Additional Practical Examples¶
Example A: Streaming / large batches with Incremental PCA
from sklearn.decomposition import IncrementalPCA
ipca = IncrementalPCA(n_components=50, batch_size=512)
for X_batch in stream_train_batches():
ipca.partial_fit(X_batch)
X_train_pca = ipca.transform(X_train)
X_test_pca = ipca.transform(X_test)
Example B: Sparse high-dimensional features
from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(n_components=300, random_state=42)
X_train_reduced = svd.fit_transform(X_train_sparse)
X_test_reduced = svd.transform(X_test_sparse)
15.8 What to Check Before Declaring PCA “Successful”¶
- Is train/test leakage avoided?
- Did validation metric improve over baseline?
- Is chosen \(k\) stable across folds?
- Is variance retained sufficient for task requirements?
- Is interpretability loss acceptable for the use case?
16. PCA Summary: Why, What It Solves, and Next Steps¶
16.1 Why We Do PCA¶
PCA transforms correlated, high-dimensional features into a compact orthogonal representation that preserves major variation.
16.2 What It Solves¶
PCA reduces: - multicollinearity and redundant features, - noise in low-variance directions, - computational burden in high-dimensional spaces.
Formally, it computes the best low-rank linear approximation of centered data under squared reconstruction error.
16.3 What It Improves¶
- faster training,
- better numerical conditioning,
- often improved generalization in high-dimensional settings,
- easier visualization (2D/3D score plots).
16.4 What It Does Not Solve¶
- nonlinear manifolds (without kernel/nonlinear extensions),
- target-aware feature selection (PCA is unsupervised),
- interpretability of original variables.
- all overfitting causes by itself (model capacity and data size still matter).
16.5 Next Steps After PCA¶
- Compare baseline model vs PCA model on validation/test metrics.
- Choose \(k\) by both explained variance and downstream metric.
- Inspect loadings to understand dominant feature mixtures.
- If linear PCA is insufficient, evaluate kernel PCA or autoencoders.
17. References for Added Practical Points¶
- scikit-learn PCA API
- scikit-learn decomposition module
- scikit-learn example: Pipeline + PCA + Logistic Regression
- MachineLearningMastery: PCA for Dimensionality Reduction in Python
- MachineLearningMastery: PCA for Visualization
- MachineLearningMastery: PCA from Scratch in Python
- Mathematics for Machine Learning (book companion)