1. Linear Regression¶

Hypothesis¶

\[ h_\theta(x) = \theta_0 + \theta_1 x_1 + \cdots + \theta_m x_m \]

Cost Function¶

\[ J(\theta)=\frac{1}{2m}\sum_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)})^2 \]

Gradient Descent Update¶

\[ \theta_j := \theta_j - \alpha \cdot \frac{1}{m}\sum_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)} \]

2. Ridge Regression (L2 Regularization)¶

Cost Function¶

\[ J(\theta)=\frac{1}{2m}\sum_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)})^2 + \frac{\lambda}{2m}\sum_{j=1}^{m}\theta_j^2 \]

Update Rule¶

\[ \theta_j := \theta_j \left(1 - \alpha \frac{\lambda}{m}\right) - \alpha\cdot\frac{1}{m}\sum_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)} \]

3. Lasso Regression (L1 Regularization)¶

Cost Function¶

\[ J(\theta)=\frac{1}{2m}\sum_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)})^2 + \frac{\lambda}{m}\sum_{j=1}^{m} |\theta_j| \]

Update Rule¶

\[ \theta_j := \theta_j - \alpha \left( \frac{1}{m}\sum_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)} + \frac{\lambda}{m}\,\text{sign}(\theta_j) \right) \]

4. Elastic Net (L1 + L2)¶

Cost Function¶

\[ J(\theta) = MSE + \lambda_1 \sum_{j=1}^{m} |\theta_j| + \lambda_2 \sum_{j=1}^{m} \theta_j^2 \]

Update Rule¶

\[ \theta_j := \theta_j - \alpha\left( \frac{1}{m}\sum_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)} + \frac{\lambda_1}{m}\text{sign}(\theta_j) \right) \]

5. Logistic Regression¶

Hypothesis¶

\[ h_\theta(x) = \sigma(\theta^T x) \]

Cost Function¶

\[ J(\theta) = -\frac{1}{m}\sum_{i=1}^{m} \left[ y^{(i)}\log(h_\theta(x^{(i)})) + (1-y^{(i)})\log(1-h_\theta(x^{(i)})) \right] \]

Gradient Descent Update¶

\[ \theta_j := \theta_j - \alpha \cdot \frac{1}{m}\sum_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)} \]

6. Regularized Logistic Regression¶

Add L1 and L2 terms exactly like linear regression:

L2 (Ridge): $$ \frac{\lambda}{2m}\sum_{j=1}^{m}\theta_j^2 $$
L1 (Lasso): $$ \frac{\lambda}{m}\sum_{j=1}^{m}|\theta_j| $$

7. SVM Geometric Intuition¶

Margin Geometry¶

SVM Margin Geometry

The center line is the decision boundary.
The two parallel support lines define the margin.
Circled points are support vectors; they control the boundary.
SVM maximizes this margin to improve robustness.

Kernel Trick Geometry¶

Kernel Trick Geometry

Left: data is not linearly separable in original space.
Right: after feature mapping, a linear separator exists.
Kernel methods compute this implicitly using similarity (dot-product-like) functions.

Constrained Optimization Geometry¶

Constrained Optimization Geometry

Optimization follows the objective landscape.
Constraints define feasible region boundaries.
Best feasible point can differ from unconstrained minimum.

============================================================¶

LINEAR & LOGISTIC REGRESSION – EXAM Q&A KIT (BITS STYLE)¶

============================================================¶

This file contains: - Conceptual Q&A - Ridge, Lasso, Elastic Net - Bias–Variance - GD vs Normal Equation - Mini-Batch / SGD / Batch - Logistic Regression gradients - Ready-to-use exam answers

------------------------------------------------------------¶

Q1. GD vs Normal Equation for Large m¶

------------------------------------------------------------¶

Question:
Why is gradient descent preferred over the normal equation when the number of features m is extremely large (e.g., m = 300,000)?

Answer:
Normal equation computes:

This requires: - Forming XᵀX → O(n m²) - Inverting an m×m matrix → O(m³) - Memory to store XᵀX → m² entries → huge

For m = 300,000: - m² = 9×10¹⁰ entries ~ 720 GB memory - m³ operations are impossible to compute

Gradient Descent: - Complexity O(n m) per iteration - Can use mini-batch GD - Scales to millions of features - No matrix inversion required

➡ GD is preferred.

------------------------------------------------------------¶

Q2. Training error ≈ Validation error and both high¶

------------------------------------------------------------¶

Question:
If training error and validation error are both high and nearly equal, does the model have high bias or high variance? What should you do with λ in Ridge Regression?

Answer:
- Training error high → model is underfitting
- Validation ≈ training error → generalizes consistently
➡ High Bias

Cause in Ridge: λ too large → weights overly penalized.

Fix: Reduce λ
This allows the model to fit better and reduces bias.

------------------------------------------------------------¶

Q3. Gradient Descent Update for Multivariate Linear Regression¶

------------------------------------------------------------¶

Given: hθ(x) = θᵀx J(θ) = (1/2m) Σ (hθ(x(i)) − y(i))²

Derivative: ∂J/∂θj = (1/m) Σ (hθ(x(i)) − y(i)) xj(i)

Gradient Descent: θj := θj − α * (1/m) Σ (hθ(x(i)) − y(i)) xj(i)

------------------------------------------------------------¶

Q4. Why Feature Scaling Helps GD¶

------------------------------------------------------------¶

When features have very different ranges, contours of J(θ) become elongated. GD oscillates along steep directions → slow convergence.

Scaling to similar ranges (0–1 or standardization): - Makes contours circular - Improves step efficiency - Faster convergence

------------------------------------------------------------¶

Q5. Ridge vs Lasso – Compare¶

------------------------------------------------------------¶

Aspect	Ridge (L2)	Lasso (L1)
Penalty	λ Σ θ²	λ Σ
Effect on Weights	Shrinks all	Sets many to zero (sparse)
Feature Selection	No	Yes
Correlations	Stable	Unstable with correlated vars
Use Case	Smooth small coefficients	Large m, need sparsity

------------------------------------------------------------¶

Q6. Ridge Regression GD Update (Derivation)¶

------------------------------------------------------------¶

Cost: J(θ) = (1/2m) Σ e(i)² + (λ/2m) Σ θj²

Derivative: ∂J/∂θj = (1/m) Σ e(i) xj(i) + (λ/m) θj

Update: θj := θj(1 - αλ/m) - α(1/m)Σ e(i)xj(i)

Bias θ₀ is not regularized.

------------------------------------------------------------¶

Q7. Why Lasso Sets Weights to Zero but Ridge Doesn’t¶

------------------------------------------------------------¶

Lasso penalty = |θj|

Derivative (subgradient):

d|θj|/dθj = +1 if θj > 0 -1 if θj < 0 anything in [-1,1] if θj = 0

This creates a constant shrinking force toward zero, driving many weights exactly to zero.

Ridge penalty θ² is smooth → shrinks but never exactly zero.

------------------------------------------------------------¶

Q8. Logistic Regression: Cost Function¶

------------------------------------------------------------¶

Hypothesis:

hθ(x) = σ(θᵀx) σ(z) = 1 / (1 + e^(−z))

Cost: J(θ) = −(1/m) Σ [ y log hθ + (1−y) log(1−hθ) ]

Why not MSE? - Non-convex with sigmoid → poor convergence - Log-loss guarantees convex optimization

------------------------------------------------------------¶

Q9. Regularized Logistic Regression GD Update¶

------------------------------------------------------------¶

Cost with L2: J(θ) = log-loss + (λ/2m) Σ θj²

Gradient: ∂J/∂θj = (1/m) Σ (hθ − y)xj + (λ/m)θj

Update: θj := θj − α[ (1/m) Σ (hθ − y)xj + (λ/m)θj ]

------------------------------------------------------------¶

Q10. Validation error increases after increasing λ¶

------------------------------------------------------------¶

If validation error ↑ but training error ↓ after increasing λ:

λ shrinks weights too much
Model becomes overly simple
➡ Underfitting / High Bias

Fix: Reduce λ

------------------------------------------------------------¶

Q11. Training error low but validation error high¶

------------------------------------------------------------¶

This is high variance (overfitting).

Fix: - Increase λ
- Reduce number of features
- Early stopping
- More training data
- Use Lasso for sparsity

------------------------------------------------------------¶

Q12. Compare Batch GD, Mini-Batch GD, and SGD¶

------------------------------------------------------------¶

Batch GD - Uses all m examples
- Stable but slow
- O(nm)

SGD - Updates per example
- Very fast
- Noisy updates

Mini-Batch (best) - Batch sizes like 32/64/128
- Stable + fast
- Best for large datasets

------------------------------------------------------------¶

Q13. For dataset with 2M samples and 300k features, which GD method?¶

------------------------------------------------------------¶

Use Mini-Batch Gradient Descent.

Reasons: - Scales to large n and m
- Fits into GPU/CPU memory
- Faster convergence
- Works with regularization

------------------------------------------------------------¶

Q14. Derive Logistic Regression Gradient (Unregularized)¶

------------------------------------------------------------¶

Given: hθ(x) = σ(θᵀx), where σ(z)=1/(1+e^(−z))

Cost: J(θ) = −(1/m) Σ [ y log(hθ) + (1−y)log(1−hθ) ]

Derivative: ∂J/∂θj = (1/m) Σ (hθ(x(i)) − y(i)) xj(i)

GD update: θj := θj − α(1/m)Σ (hθ − y)xj

------------------------------------------------------------¶

Q15. Elastic Net Regression¶

------------------------------------------------------------¶

Penalty: λ1 Σ |θj| (L1) λ2 Σ θj² (L2)

Combines Lasso + Ridge: - Encourages sparsity but controlled coefficient shrinkage - Stable even with correlated features

------------------------------------------------------------¶

Q16. Bias-Variance Summary (Must Memorize)¶

------------------------------------------------------------¶

High Bias (Underfitting) - Training error high
- Validation ≈ Training
Fix: - Reduce λ
- Add features
- More complex model

High Variance (Overfitting) - Training error low
- Validation error very high
Fix: - Increase λ
- Reduce model complexity
- Lasso (feature selection)
- More data

------------------------------------------------------------¶

END OF MARKDOWN EXAM KIT¶

------------------------------------------------------------¶

Q1. For the table in the question, which λ would you choose and why?

A1. Likely λ = 0.001: - It has the highest CV accuracy (0.95). - Although it uses many features, the primary goal is classification performance. - Depending on deployment constraints (model size), you might trade a bit of accuracy for sparsity (e.g., λ = 0.01).

Q2. Suppose for λ = 0.001, training accuracy = 0.99 and CV accuracy = 0.95. What does this indicate?

A2. Training much higher than CV → some overfitting (high variance). Regularization is relatively weak; you might slightly increase λ to reduce variance without losing much accuracy.

Q3. Suppose for λ = 10, training accuracy = 0.76 and CV accuracy = 0.75. What then?

A3. Training ≈ CV and both are low → high bias, strong underfitting. λ is too large; reduce λ to allow more features and better fit.

Q4. In text classification with 100,000 bag-of-words features and 5,000 samples, why is regularization especially important?

A4. Because: - m >> n (many more features than samples). - High risk of overfitting. - Regularization (especially L1) helps by selecting a small subset of informative words.

Q5. Why is L1 often preferred over L2 in high-dimensional sparse problems like bag-of-words?

A5. L1: - Produces sparse models (few non-zero weights). - Performs implicit feature selection. - Leads to smaller, faster models and can improve generalization when many features are irrelevant.

(a) How does λ control the bias–variance trade-off?¶

Regularization strength λ controls how much large weights are penalized.

Small λ (e.g., 0.001)¶

Weak regularization → flexible model
Low bias, but high variance
Uses many features (e.g., 95k) → complex model

Large λ (e.g., 10)¶

Strong regularization → many weights shrunk or zeroed
High bias, low variance
Very few features used (e.g., 2k) → overly simple model, may underfit

Conclusion:
- Increasing λ → increases bias, decreases variance
- Decreasing λ → decreases bias, increases variance

In the table, as λ increases, accuracy falls → bias increases and the model underfits.

(b) What type of regularization was likely used? Justify.¶

The “Selected Features” column shows:

λ = 0.001 → ~95,000 features
λ = 10 → ~2,000 features

A huge drop in feature count means many coefficients were driven exactly to zero as λ increased.
That behavior is characteristic of L1 regularization (Lasso):

L1 → produces sparse solutions (feature selection)
L2 → shrinks coefficients but rarely makes them exactly zero

Answer:
Likely L1 regularization (Lasso), because increasing λ drastically reduces the number of selected features, indicating many coefficients become exactly zero.

(c) Can regularization also cause underfitting? Justify with this example.¶

Yes.

When λ is very large (e.g., 10):

CV accuracy drops sharply (e.g., from 0.95 → 0.75)
Only ~2,000 out of 100,000 features are used
The model becomes too simple for the task

This is classic underfitting due to over-regularization:

High λ → too much penalty → coefficients overly shrunk
Model loses expressive power → high bias

Thus, regularization can cause underfitting when λ is too large.

(d) How do L1 and L2 handle highly correlated features differently?¶

L1 Regularization (Lasso)¶

Tends to pick one of the correlated features
Sets the others’ coefficients to zero
Produces a sparse model
Selection can be unstable (depends on data/noise)

L2 Regularization (Ridge)¶

Distributes weights across the correlated features
Keeps all features but with smaller coefficients
More stable when multicollinearity is present

Summary:
- L1 → chooses one correlated feature, drops the rest
- L2 → keeps all correlated features with shared reduced weights