Unconstrained Optimization and Adaptive Optimizers for ML¶
This chapter covers unconstrained optimization in machine learning, from classical Gradient Descent to adaptive methods such as AdaGrad, RMSProp, and Adam.
1. What Is Unconstrained Optimization?¶
In unconstrained optimization, we minimize an objective without explicit feasibility constraints:
where: - \(\theta\) = model parameters, - \(J(\theta)\) = loss (objective) function.
In ML, training is typically solving this optimization problem iteratively.
2. Gradient Descent (GD)¶
2.1 Definition¶
Gradient Descent follows the negative gradient direction to reduce loss.
Update rule:
where \(\eta\) is the learning rate.
2.2 Challenge of Gradient Descent¶
A key limitation is that GD uses the same learning rate for all parameters.
Implications: - slow progress in flat directions, - oscillation in steep directions, - sensitive tuning of \(\eta\).
2.3 Geometric Intuition¶
Imagine descending a valley: - too small \(\eta\): tiny slow steps, - too large \(\eta\): bouncing/overshoot, - anisotropic curvature: zig-zag trajectories.
3. AdaGrad¶
3.1 Definition¶
AdaGrad adapts step size per-parameter using accumulated squared gradients.
Per-parameter accumulator:
Update:
3.2 Why It Helps¶
- frequent/high-gradient parameters get smaller updates,
- infrequent/sparse parameters get relatively larger updates.
This is useful in sparse-feature settings (for example text features).
3.3 Challenge of AdaGrad¶
Because \(G_{t,i}\) keeps growing, effective learning rates can shrink too much over time, causing premature slow convergence.
4. RMSProp (Root Mean Squared Propagation)¶
4.1 Definition¶
RMSProp replaces cumulative sum with an exponential moving average of squared gradients.
Second-moment tracker:
Update:
4.2 Why RMSProp Solves AdaGrad’s Decay Issue¶
The moving average forgets very old gradients, so step sizes do not collapse as aggressively as AdaGrad.
4.3 Use Cases¶
Commonly effective for non-stationary objectives and deep networks where gradient scales vary across parameters.
5. Adam (Adaptive Moment Estimation)¶
5.1 Definition¶
Adam combines: - momentum-like first moment, - RMSProp-like second moment, - bias correction.
First moment:
Second moment:
Bias-corrected estimates:
Update:
Typical defaults:
5.2 Why Adam Is Popular¶
- robust default behavior,
- adaptive per-parameter updates,
- fast convergence in many deep learning workloads,
- handles noisy gradients reasonably well.
5.3 Practical Caveat¶
Adam is a strong default but not always best for generalization on every task; testing against SGD(+momentum) remains good practice.
6. Pseudocode (Professor-Style)¶
6.1 Gradient Descent¶
Initialize θ
for t = 1..T:
g_t = ∇J(θ_t)
θ_{t+1} = θ_t - η g_t
return θ_T
6.2 AdaGrad¶
Initialize θ, G = 0
for t = 1..T:
g_t = ∇J(θ_t)
G = G + g_t ⊙ g_t
θ_{t+1} = θ_t - η * g_t / (sqrt(G) + ε)
return θ_T
6.3 RMSProp¶
Initialize θ, s = 0
for t = 1..T:
g_t = ∇J(θ_t)
s = ρ s + (1-ρ)(g_t ⊙ g_t)
θ_{t+1} = θ_t - η * g_t / (sqrt(s) + ε)
return θ_T
6.4 Adam¶
Initialize θ, m = 0, v = 0
for t = 1..T:
g_t = ∇J(θ_t)
m = β1 m + (1-β1) g_t
v = β2 v + (1-β2) (g_t ⊙ g_t)
m_hat = m / (1-β1^t)
v_hat = v / (1-β2^t)
θ_{t+1} = θ_t - η * m_hat / (sqrt(v_hat) + ε)
return θ_T
7. Real-Life ML Use Cases¶
7.1 Linear / Logistic Models¶
- GD/SGD variants train linear models at scale.
- Common in ad-click prediction, churn scoring, fraud filtering.
7.2 Deep Neural Networks¶
- Adam/RMSProp are widely used in vision, NLP, speech, recommendation.
- Practical flow: start with Adam baseline, then compare with SGD+momentum for final generalization.
7.3 Sparse Features¶
- AdaGrad historically effective in sparse settings (text and CTR features).
8. Geometric Intuition Across Optimizers¶
- GD: same step scale in every direction.
- AdaGrad: shrinking step size based on historical gradient energy.
- RMSProp: adaptive scale with forgetting (recent curvature emphasis).
- Adam: momentum + adaptive scaling; smoother trajectory through ravines/saddle regions.
Think of a ball descending a bowl: - momentum helps pass shallow flat zones, - adaptive scaling prevents excessive zig-zag in steep axes.
9. Measurement and Evaluation During Optimization¶
Track these while training: 1. training loss curve vs iterations/epochs, 2. validation loss and target metric (accuracy/F1/AUC/RMSE), 3. gradient norm \(\|\nabla J\|\) (convergence diagnostics), 4. update norm \(\|\Delta\theta\|\) (step stability), 5. wall-clock time to target metric, 6. robustness across random seeds.
Convergence signals: - loss plateau, - small gradient norms, - stable validation metric.
10. Smart Technical Tricks¶
- Normalize/standardize inputs before optimization.
- Use mini-batch training for speed/stability trade-off.
- Apply learning-rate schedules (decay, cosine, warmup where needed).
- Use gradient clipping for unstable deep models.
- Compare at least two optimizers before final model selection.
- Prefer metric-driven selection, not only faster loss decrease.
- Use early stopping with validation to reduce overfitting risk.
11. Quick Comparison Table¶
| Optimizer | Core idea | Strength | Limitation |
|---|---|---|---|
| GD | Global fixed learning rate | Simple and stable on convex tasks | Slow; sensitive to LR; same step for all parameters |
| AdaGrad | Per-parameter scaling via cumulative squared gradients | Strong for sparse features | Learning rate can become too small |
| RMSProp | EMA of squared gradients | Fixes AdaGrad decay issue | Hyperparameter sensitivity remains |
| Adam | Momentum + RMSProp + bias correction | Strong default for deep learning | Not always best generalization on every task |
12. Next Practical Steps¶
- Implement all four optimizers on the same task and compare:
- final validation metric,
- time-to-threshold,
- stability over seeds.
- Tune learning rate first, then momentum/beta terms.
- Add scheduler + early stopping and re-compare.
- Choose optimizer by validation outcomes, not popularity.
13. Emerging Research Topics from arXiv (Useful for Modern ML)¶
13.1 AdamW (Decoupled Weight Decay)¶
Key idea: separate weight decay from gradient-based Adam update instead of mixing it into gradient penalty.
Why useful: - often improves generalization over naive L2-with-Adam implementations, - decouples regularization tuning from learning-rate scaling more cleanly.
13.2 AMSGrad (Convergence Fix for Adam Family)¶
Key idea: enforce a non-increasing effective second-moment denominator via max-tracking.
Why useful: - addresses known convergence pathology of Adam in some settings, - important when theoretical convergence guarantees matter.
13.3 Adafactor (Memory-Efficient Adaptive Optimization)¶
Key idea: reduce optimizer-state memory with factored second-moment estimates.
Why useful: - large transformer-scale models where optimizer memory is a bottleneck, - enables training larger models under fixed hardware constraints.
13.4 Lion (Sign-Based Optimizer Discovery)¶
Key idea: optimizer discovered via symbolic search using sign-based momentum updates.
Why useful: - lower state overhead than Adam-like methods, - competitive performance reported in several large-scale workloads.
13.5 Sharpness-Aware Minimization (SAM)¶
Key idea: optimize parameters that remain good under local perturbations, not just pointwise loss minima.
Why useful: - often improves generalization by preferring flatter minima, - practical when validation gap is a primary concern.
13.6 Practical Recommendation from Research¶
For modern deep-learning baselines, compare at least: 1. AdamW 2. SGD + momentum 3. One memory-efficient alternative (Adafactor/Lion where applicable) 4. Optional SAM overlay when generalization is weak
References¶
- MachineLearningMastery: Gradient Descent With Momentum from Scratch
- MachineLearningMastery: Gradient Descent With RMSProp from Scratch
- MachineLearningMastery: Code Adam Optimization Algorithm From Scratch
- Google for Developers: Gradient Descent (ML Crash Course)
- TensorFlow:
tf.keras.optimizersmodule - TensorFlow:
tf.keras.optimizers.RMSprop - IBM: What is Gradient Descent?
- IBM Research: Adaptive Subgradient Methods (AdaGrad), COLT 2010
- Adam paper (arXiv)
- PyTorch Optimizers Overview
- PyTorch Adam documentation
- Kaggle example notebook: Linear Regression from Scratch (Gradient Descent)
- Decoupled Weight Decay Regularization (AdamW), arXiv:1711.05101
- On the Convergence of Adam and Beyond (AMSGrad), arXiv:1904.09237
- Adafactor: Adaptive Learning Rates with Sublinear Memory Cost, arXiv:1804.04235
- Symbolic Discovery of Optimization Algorithms (Lion), arXiv:2302.06675
- Sharpness-Aware Minimization for Efficiently Improving Generalization, arXiv:2010.01412