1. Linear Regression¶
Hypothesis¶
Cost Function¶
Gradient Descent Update¶
2. Ridge Regression (L2 Regularization)¶
Cost Function¶
Update Rule¶
3. Lasso Regression (L1 Regularization)¶
Cost Function¶
Update Rule¶
4. Elastic Net (L1 + L2)¶
Cost Function¶
Update Rule¶
5. Logistic Regression¶
Hypothesis¶
Cost Function¶
Gradient Descent Update¶
6. Regularized Logistic Regression¶
Add L1 and L2 terms exactly like linear regression:
-
L2 (Ridge): $$ \frac{\lambda}{2m}\sum_{j=1}^{m}\theta_j^2 $$
-
L1 (Lasso): $$ \frac{\lambda}{m}\sum_{j=1}^{m}|\theta_j| $$
============================================================¶
LINEAR & LOGISTIC REGRESSION – EXAM Q&A KIT (BITS STYLE)¶
============================================================¶
This file contains: - Conceptual Q&A - Ridge, Lasso, Elastic Net - Bias–Variance - GD vs Normal Equation - Mini-Batch / SGD / Batch - Logistic Regression gradients - Ready-to-use exam answers
------------------------------------------------------------¶
Q1. GD vs Normal Equation for Large m¶
------------------------------------------------------------¶
Question:
Why is gradient descent preferred over the normal equation when the number
of features m is extremely large (e.g., m = 300,000)?
Answer:
Normal equation computes:
This requires: - Forming XᵀX → O(n m²) - Inverting an m×m matrix → O(m³) - Memory to store XᵀX → m² entries → huge
For m = 300,000: - m² = 9×10¹⁰ entries ~ 720 GB memory - m³ operations are impossible to compute
Gradient Descent: - Complexity O(n m) per iteration - Can use mini-batch GD - Scales to millions of features - No matrix inversion required
➡ GD is preferred.
------------------------------------------------------------¶
Q2. Training error ≈ Validation error and both high¶
------------------------------------------------------------¶
Question:
If training error and validation error are both high and nearly equal, does the
model have high bias or high variance? What should you do with λ in Ridge Regression?
Answer:
- Training error high → model is underfitting
- Validation ≈ training error → generalizes consistently
➡ High Bias
Cause in Ridge: λ too large → weights overly penalized.
Fix: Reduce λ
This allows the model to fit better and reduces bias.
------------------------------------------------------------¶
Q3. Gradient Descent Update for Multivariate Linear Regression¶
------------------------------------------------------------¶
Given: hθ(x) = θᵀx J(θ) = (1/2m) Σ (hθ(x(i)) − y(i))²
Derivative: ∂J/∂θj = (1/m) Σ (hθ(x(i)) − y(i)) xj(i)
Gradient Descent: θj := θj − α * (1/m) Σ (hθ(x(i)) − y(i)) xj(i)
------------------------------------------------------------¶
Q4. Why Feature Scaling Helps GD¶
------------------------------------------------------------¶
When features have very different ranges, contours of J(θ) become elongated. GD oscillates along steep directions → slow convergence.
Scaling to similar ranges (0–1 or standardization): - Makes contours circular - Improves step efficiency - Faster convergence
------------------------------------------------------------¶
Q5. Ridge vs Lasso – Compare¶
------------------------------------------------------------¶
| Aspect | Ridge (L2) | Lasso (L1) |
|---|---|---|
| Penalty | λ Σ θ² | λ Σ |
| Effect on Weights | Shrinks all | Sets many to zero (sparse) |
| Feature Selection | No | Yes |
| Correlations | Stable | Unstable with correlated vars |
| Use Case | Smooth small coefficients | Large m, need sparsity |
------------------------------------------------------------¶
Q6. Ridge Regression GD Update (Derivation)¶
------------------------------------------------------------¶
Cost: J(θ) = (1/2m) Σ e(i)² + (λ/2m) Σ θj²
Derivative: ∂J/∂θj = (1/m) Σ e(i) xj(i) + (λ/m) θj
Update: θj := θj(1 - αλ/m) - α(1/m)Σ e(i)xj(i)
Bias θ₀ is not regularized.
------------------------------------------------------------¶
Q7. Why Lasso Sets Weights to Zero but Ridge Doesn’t¶
------------------------------------------------------------¶
Lasso penalty = |θj|
Derivative (subgradient):
d|θj|/dθj = +1 if θj > 0 -1 if θj < 0 anything in [-1,1] if θj = 0
This creates a constant shrinking force toward zero, driving many weights exactly to zero.
Ridge penalty θ² is smooth → shrinks but never exactly zero.
------------------------------------------------------------¶
Q8. Logistic Regression: Cost Function¶
------------------------------------------------------------¶
Hypothesis:
hθ(x) = σ(θᵀx) σ(z) = 1 / (1 + e^(−z))
Cost: J(θ) = −(1/m) Σ [ y log hθ + (1−y) log(1−hθ) ]
Why not MSE? - Non-convex with sigmoid → poor convergence - Log-loss guarantees convex optimization
------------------------------------------------------------¶
Q9. Regularized Logistic Regression GD Update¶
------------------------------------------------------------¶
Cost with L2: J(θ) = log-loss + (λ/2m) Σ θj²
Gradient: ∂J/∂θj = (1/m) Σ (hθ − y)xj + (λ/m)θj
Update: θj := θj − α[ (1/m) Σ (hθ − y)xj + (λ/m)θj ]
------------------------------------------------------------¶
Q10. Validation error increases after increasing λ¶
------------------------------------------------------------¶
If validation error ↑ but training error ↓ after increasing λ:
- λ shrinks weights too much
- Model becomes overly simple
➡ Underfitting / High Bias
Fix: Reduce λ
------------------------------------------------------------¶
Q11. Training error low but validation error high¶
------------------------------------------------------------¶
This is high variance (overfitting).
Fix:
- Increase λ
- Reduce number of features
- Early stopping
- More training data
- Use Lasso for sparsity
------------------------------------------------------------¶
Q12. Compare Batch GD, Mini-Batch GD, and SGD¶
------------------------------------------------------------¶
Batch GD
- Uses all m examples
- Stable but slow
- O(nm)
SGD
- Updates per example
- Very fast
- Noisy updates
Mini-Batch (best)
- Batch sizes like 32/64/128
- Stable + fast
- Best for large datasets
------------------------------------------------------------¶
Q13. For dataset with 2M samples and 300k features, which GD method?¶
------------------------------------------------------------¶
Use Mini-Batch Gradient Descent.
Reasons:
- Scales to large n and m
- Fits into GPU/CPU memory
- Faster convergence
- Works with regularization
------------------------------------------------------------¶
Q14. Derive Logistic Regression Gradient (Unregularized)¶
------------------------------------------------------------¶
Given: hθ(x) = σ(θᵀx), where σ(z)=1/(1+e^(−z))
Cost: J(θ) = −(1/m) Σ [ y log(hθ) + (1−y)log(1−hθ) ]
Derivative: ∂J/∂θj = (1/m) Σ (hθ(x(i)) − y(i)) xj(i)
GD update: θj := θj − α(1/m)Σ (hθ − y)xj
------------------------------------------------------------¶
Q15. Elastic Net Regression¶
------------------------------------------------------------¶
Penalty: λ1 Σ |θj| (L1) λ2 Σ θj² (L2)
Combines Lasso + Ridge: - Encourages sparsity but controlled coefficient shrinkage - Stable even with correlated features
------------------------------------------------------------¶
Q16. Bias-Variance Summary (Must Memorize)¶
------------------------------------------------------------¶
High Bias (Underfitting)
- Training error high
- Validation ≈ Training
Fix:
- Reduce λ
- Add features
- More complex model
High Variance (Overfitting)
- Training error low
- Validation error very high
Fix:
- Increase λ
- Reduce model complexity
- Lasso (feature selection)
- More data
------------------------------------------------------------¶
END OF MARKDOWN EXAM KIT¶
------------------------------------------------------------¶
Q1. For the table in the question, which λ would you choose and why?
A1. Likely λ = 0.001: - It has the highest CV accuracy (0.95). - Although it uses many features, the primary goal is classification performance. - Depending on deployment constraints (model size), you might trade a bit of accuracy for sparsity (e.g., λ = 0.01).
Q2. Suppose for λ = 0.001, training accuracy = 0.99 and CV accuracy = 0.95. What does this indicate?
A2. Training much higher than CV → some overfitting (high variance). Regularization is relatively weak; you might slightly increase λ to reduce variance without losing much accuracy.
Q3. Suppose for λ = 10, training accuracy = 0.76 and CV accuracy = 0.75. What then?
A3. Training ≈ CV and both are low → high bias, strong underfitting. λ is too large; reduce λ to allow more features and better fit.
Q4. In text classification with 100,000 bag-of-words features and 5,000 samples, why is regularization especially important?
A4. Because: - m >> n (many more features than samples). - High risk of overfitting. - Regularization (especially L1) helps by selecting a small subset of informative words.
Q5. Why is L1 often preferred over L2 in high-dimensional sparse problems like bag-of-words?
A5. L1: - Produces sparse models (few non-zero weights). - Performs implicit feature selection. - Leads to smaller, faster models and can improve generalization when many features are irrelevant.
(a) How does λ control the bias–variance trade-off?¶
Regularization strength λ controls how much large weights are penalized.
Small λ (e.g., 0.001)¶
- Weak regularization → flexible model
- Low bias, but high variance
- Uses many features (e.g., 95k) → complex model
Large λ (e.g., 10)¶
- Strong regularization → many weights shrunk or zeroed
- High bias, low variance
- Very few features used (e.g., 2k) → overly simple model, may underfit
Conclusion:
- Increasing λ → increases bias, decreases variance
- Decreasing λ → decreases bias, increases variance
In the table, as λ increases, accuracy falls → bias increases and the model underfits.
(b) What type of regularization was likely used? Justify.¶
The “Selected Features” column shows:
- λ = 0.001 → ~95,000 features
- λ = 10 → ~2,000 features
A huge drop in feature count means many coefficients were driven exactly to zero as λ increased.
That behavior is characteristic of L1 regularization (Lasso):
- L1 → produces sparse solutions (feature selection)
- L2 → shrinks coefficients but rarely makes them exactly zero
Answer:
Likely L1 regularization (Lasso), because increasing λ drastically reduces the number of selected features, indicating many coefficients become exactly zero.
(c) Can regularization also cause underfitting? Justify with this example.¶
Yes.
When λ is very large (e.g., 10):
- CV accuracy drops sharply (e.g., from 0.95 → 0.75)
- Only ~2,000 out of 100,000 features are used
- The model becomes too simple for the task
This is classic underfitting due to over-regularization:
- High λ → too much penalty → coefficients overly shrunk
- Model loses expressive power → high bias
Thus, regularization can cause underfitting when λ is too large.
(d) How do L1 and L2 handle highly correlated features differently?¶
L1 Regularization (Lasso)¶
- Tends to pick one of the correlated features
- Sets the others’ coefficients to zero
- Produces a sparse model
- Selection can be unstable (depends on data/noise)
L2 Regularization (Ridge)¶
- Distributes weights across the correlated features
- Keeps all features but with smaller coefficients
- More stable when multicollinearity is present
Summary:
- L1 → chooses one correlated feature, drops the rest
- L2 → keeps all correlated features with shared reduced weights