Skip to content

1. Linear Regression

Hypothesis

\[ h_\theta(x) = \theta_0 + \theta_1 x_1 + \cdots + \theta_m x_m \]

Cost Function

\[ J(\theta)=\frac{1}{2m}\sum_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)})^2 \]

Gradient Descent Update

\[ \theta_j := \theta_j - \alpha \cdot \frac{1}{m}\sum_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)} \]

2. Ridge Regression (L2 Regularization)

Cost Function

\[ J(\theta)=\frac{1}{2m}\sum_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)})^2 + \frac{\lambda}{2m}\sum_{j=1}^{m}\theta_j^2 \]

Update Rule

\[ \theta_j := \theta_j \left(1 - \alpha \frac{\lambda}{m}\right) - \alpha\cdot\frac{1}{m}\sum_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)} \]

3. Lasso Regression (L1 Regularization)

Cost Function

\[ J(\theta)=\frac{1}{2m}\sum_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)})^2 + \frac{\lambda}{m}\sum_{j=1}^{m} |\theta_j| \]

Update Rule

\[ \theta_j := \theta_j - \alpha \left( \frac{1}{m}\sum_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)} + \frac{\lambda}{m}\,\text{sign}(\theta_j) \right) \]

4. Elastic Net (L1 + L2)

Cost Function

\[ J(\theta) = MSE + \lambda_1 \sum_{j=1}^{m} |\theta_j| + \lambda_2 \sum_{j=1}^{m} \theta_j^2 \]

Update Rule

\[ \theta_j := \theta_j - \alpha\left( \frac{1}{m}\sum_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)} + \frac{\lambda_1}{m}\text{sign}(\theta_j) \right) \]

5. Logistic Regression

Hypothesis

\[ h_\theta(x) = \sigma(\theta^T x) \]

Cost Function

\[ J(\theta) = -\frac{1}{m}\sum_{i=1}^{m} \left[ y^{(i)}\log(h_\theta(x^{(i)})) + (1-y^{(i)})\log(1-h_\theta(x^{(i)})) \right] \]

Gradient Descent Update

\[ \theta_j := \theta_j - \alpha \cdot \frac{1}{m}\sum_{i=1}^{m}(h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)} \]

6. Regularized Logistic Regression

Add L1 and L2 terms exactly like linear regression:

  • L2 (Ridge): $$ \frac{\lambda}{2m}\sum_{j=1}^{m}\theta_j^2 $$

  • L1 (Lasso): $$ \frac{\lambda}{m}\sum_{j=1}^{m}|\theta_j| $$

============================================================

LINEAR & LOGISTIC REGRESSION – EXAM Q&A KIT (BITS STYLE)

============================================================

This file contains: - Conceptual Q&A - Ridge, Lasso, Elastic Net - Bias–Variance - GD vs Normal Equation - Mini-Batch / SGD / Batch - Logistic Regression gradients - Ready-to-use exam answers

------------------------------------------------------------

Q1. GD vs Normal Equation for Large m

------------------------------------------------------------

Question:
Why is gradient descent preferred over the normal equation when the number of features m is extremely large (e.g., m = 300,000)?

Answer:
Normal equation computes:

This requires: - Forming XᵀX → O(n m²) - Inverting an m×m matrix → O(m³) - Memory to store XᵀX → m² entries → huge

For m = 300,000: - m² = 9×10¹⁰ entries ~ 720 GB memory - m³ operations are impossible to compute

Gradient Descent: - Complexity O(n m) per iteration - Can use mini-batch GD - Scales to millions of features - No matrix inversion required

GD is preferred.

------------------------------------------------------------

Q2. Training error ≈ Validation error and both high

------------------------------------------------------------

Question:
If training error and validation error are both high and nearly equal, does the model have high bias or high variance? What should you do with λ in Ridge Regression?

Answer:
- Training error high → model is underfitting
- Validation ≈ training error → generalizes consistently
High Bias

Cause in Ridge: λ too large → weights overly penalized.

Fix: Reduce λ
This allows the model to fit better and reduces bias.

------------------------------------------------------------

Q3. Gradient Descent Update for Multivariate Linear Regression

------------------------------------------------------------

Given: hθ(x) = θᵀx J(θ) = (1/2m) Σ (hθ(x(i)) − y(i))²

Derivative: ∂J/∂θj = (1/m) Σ (hθ(x(i)) − y(i)) xj(i)

Gradient Descent: θj := θj − α * (1/m) Σ (hθ(x(i)) − y(i)) xj(i)

------------------------------------------------------------

Q4. Why Feature Scaling Helps GD

------------------------------------------------------------

When features have very different ranges, contours of J(θ) become elongated. GD oscillates along steep directions → slow convergence.

Scaling to similar ranges (0–1 or standardization): - Makes contours circular - Improves step efficiency - Faster convergence

------------------------------------------------------------

Q5. Ridge vs Lasso – Compare

------------------------------------------------------------

Aspect Ridge (L2) Lasso (L1)
Penalty λ Σ θ² λ Σ
Effect on Weights Shrinks all Sets many to zero (sparse)
Feature Selection No Yes
Correlations Stable Unstable with correlated vars
Use Case Smooth small coefficients Large m, need sparsity

------------------------------------------------------------

Q6. Ridge Regression GD Update (Derivation)

------------------------------------------------------------

Cost: J(θ) = (1/2m) Σ e(i)² + (λ/2m) Σ θj²

Derivative: ∂J/∂θj = (1/m) Σ e(i) xj(i) + (λ/m) θj

Update: θj := θj(1 - αλ/m) - α(1/m)Σ e(i)xj(i)

Bias θ₀ is not regularized.

------------------------------------------------------------

Q7. Why Lasso Sets Weights to Zero but Ridge Doesn’t

------------------------------------------------------------

Lasso penalty = |θj|

Derivative (subgradient):

d|θj|/dθj = +1 if θj > 0 -1 if θj < 0 anything in [-1,1] if θj = 0

This creates a constant shrinking force toward zero, driving many weights exactly to zero.

Ridge penalty θ² is smooth → shrinks but never exactly zero.

------------------------------------------------------------

Q8. Logistic Regression: Cost Function

------------------------------------------------------------

Hypothesis:

hθ(x) = σ(θᵀx) σ(z) = 1 / (1 + e^(−z))

Cost: J(θ) = −(1/m) Σ [ y log hθ + (1−y) log(1−hθ) ]

Why not MSE? - Non-convex with sigmoid → poor convergence - Log-loss guarantees convex optimization

------------------------------------------------------------

Q9. Regularized Logistic Regression GD Update

------------------------------------------------------------

Cost with L2: J(θ) = log-loss + (λ/2m) Σ θj²

Gradient: ∂J/∂θj = (1/m) Σ (hθ − y)xj + (λ/m)θj

Update: θj := θj − α[ (1/m) Σ (hθ − y)xj + (λ/m)θj ]

------------------------------------------------------------

Q10. Validation error increases after increasing λ

------------------------------------------------------------

If validation error ↑ but training error ↓ after increasing λ:

  • λ shrinks weights too much
  • Model becomes overly simple
    ➡ Underfitting / High Bias

Fix: Reduce λ

------------------------------------------------------------

Q11. Training error low but validation error high

------------------------------------------------------------

This is high variance (overfitting).

Fix: - Increase λ
- Reduce number of features
- Early stopping
- More training data
- Use Lasso for sparsity

------------------------------------------------------------

Q12. Compare Batch GD, Mini-Batch GD, and SGD

------------------------------------------------------------

Batch GD - Uses all m examples
- Stable but slow
- O(nm)

SGD - Updates per example
- Very fast
- Noisy updates

Mini-Batch (best) - Batch sizes like 32/64/128
- Stable + fast
- Best for large datasets

------------------------------------------------------------

Q13. For dataset with 2M samples and 300k features, which GD method?

------------------------------------------------------------

Use Mini-Batch Gradient Descent.

Reasons: - Scales to large n and m
- Fits into GPU/CPU memory
- Faster convergence
- Works with regularization

------------------------------------------------------------

Q14. Derive Logistic Regression Gradient (Unregularized)

------------------------------------------------------------

Given: hθ(x) = σ(θᵀx), where σ(z)=1/(1+e^(−z))

Cost: J(θ) = −(1/m) Σ [ y log(hθ) + (1−y)log(1−hθ) ]

Derivative: ∂J/∂θj = (1/m) Σ (hθ(x(i)) − y(i)) xj(i)

GD update: θj := θj − α(1/m)Σ (hθ − y)xj

------------------------------------------------------------

Q15. Elastic Net Regression

------------------------------------------------------------

Penalty: λ1 Σ |θj| (L1) λ2 Σ θj² (L2)

Combines Lasso + Ridge: - Encourages sparsity but controlled coefficient shrinkage - Stable even with correlated features

------------------------------------------------------------

Q16. Bias-Variance Summary (Must Memorize)

------------------------------------------------------------

High Bias (Underfitting) - Training error high
- Validation ≈ Training
Fix: - Reduce λ
- Add features
- More complex model

High Variance (Overfitting) - Training error low
- Validation error very high
Fix: - Increase λ
- Reduce model complexity
- Lasso (feature selection)
- More data

------------------------------------------------------------

END OF MARKDOWN EXAM KIT

------------------------------------------------------------

Q1. For the table in the question, which λ would you choose and why?

A1. Likely λ = 0.001: - It has the highest CV accuracy (0.95). - Although it uses many features, the primary goal is classification performance. - Depending on deployment constraints (model size), you might trade a bit of accuracy for sparsity (e.g., λ = 0.01).


Q2. Suppose for λ = 0.001, training accuracy = 0.99 and CV accuracy = 0.95. What does this indicate?

A2. Training much higher than CV → some overfitting (high variance). Regularization is relatively weak; you might slightly increase λ to reduce variance without losing much accuracy.


Q3. Suppose for λ = 10, training accuracy = 0.76 and CV accuracy = 0.75. What then?

A3. Training ≈ CV and both are low → high bias, strong underfitting. λ is too large; reduce λ to allow more features and better fit.


Q4. In text classification with 100,000 bag-of-words features and 5,000 samples, why is regularization especially important?

A4. Because: - m >> n (many more features than samples). - High risk of overfitting. - Regularization (especially L1) helps by selecting a small subset of informative words.


Q5. Why is L1 often preferred over L2 in high-dimensional sparse problems like bag-of-words?

A5. L1: - Produces sparse models (few non-zero weights). - Performs implicit feature selection. - Leads to smaller, faster models and can improve generalization when many features are irrelevant.

(a) How does λ control the bias–variance trade-off?

Regularization strength λ controls how much large weights are penalized.

Small λ (e.g., 0.001)

  • Weak regularization → flexible model
  • Low bias, but high variance
  • Uses many features (e.g., 95k) → complex model

Large λ (e.g., 10)

  • Strong regularization → many weights shrunk or zeroed
  • High bias, low variance
  • Very few features used (e.g., 2k) → overly simple model, may underfit

Conclusion:
- Increasing λ → increases bias, decreases variance
- Decreasing λ → decreases bias, increases variance

In the table, as λ increases, accuracy falls → bias increases and the model underfits.


(b) What type of regularization was likely used? Justify.

The “Selected Features” column shows:

  • λ = 0.001 → ~95,000 features
  • λ = 10 → ~2,000 features

A huge drop in feature count means many coefficients were driven exactly to zero as λ increased.
That behavior is characteristic of L1 regularization (Lasso):

  • L1 → produces sparse solutions (feature selection)
  • L2 → shrinks coefficients but rarely makes them exactly zero

Answer:
Likely L1 regularization (Lasso), because increasing λ drastically reduces the number of selected features, indicating many coefficients become exactly zero.


(c) Can regularization also cause underfitting? Justify with this example.

Yes.

When λ is very large (e.g., 10):

  • CV accuracy drops sharply (e.g., from 0.95 → 0.75)
  • Only ~2,000 out of 100,000 features are used
  • The model becomes too simple for the task

This is classic underfitting due to over-regularization:

  • High λ → too much penalty → coefficients overly shrunk
  • Model loses expressive power → high bias

Thus, regularization can cause underfitting when λ is too large.


(d) How do L1 and L2 handle highly correlated features differently?

L1 Regularization (Lasso)

  • Tends to pick one of the correlated features
  • Sets the others’ coefficients to zero
  • Produces a sparse model
  • Selection can be unstable (depends on data/noise)

L2 Regularization (Ridge)

  • Distributes weights across the correlated features
  • Keeps all features but with smaller coefficients
  • More stable when multicollinearity is present

Summary:
- L1 → chooses one correlated feature, drops the rest
- L2 → keeps all correlated features with shared reduced weights