Unconstrained Optimization and Adaptive Optimizers for ML¶

This chapter covers unconstrained optimization in machine learning, from classical Gradient Descent to adaptive methods such as AdaGrad, RMSProp, and Adam.

1. What Is Unconstrained Optimization?¶

In unconstrained optimization, we minimize an objective without explicit feasibility constraints:

\[ \min_{\theta} J(\theta) \]

where: - \(\theta\) = model parameters, - \(J(\theta)\) = loss (objective) function.

In ML, training is typically solving this optimization problem iteratively.

2. Gradient Descent (GD)¶

2.1 Definition¶

Gradient Descent follows the negative gradient direction to reduce loss.

Update rule:

\[ \theta_{t+1} = \theta_t - \eta \nabla J(\theta_t) \]

where \(\eta\) is the learning rate.

2.2 Challenge of Gradient Descent¶

A key limitation is that GD uses the same learning rate for all parameters.

Implications: - slow progress in flat directions, - oscillation in steep directions, - sensitive tuning of \(\eta\).

2.3 Geometric Intuition¶

Imagine descending a valley: - too small \(\eta\): tiny slow steps, - too large \(\eta\): bouncing/overshoot, - anisotropic curvature: zig-zag trajectories.

3. AdaGrad¶

3.1 Definition¶

AdaGrad adapts step size per-parameter using accumulated squared gradients.

Per-parameter accumulator:

\[ G_{t,i}=\sum_{\tau=1}^{t} g_{\tau,i}^2 \]

Update:

\[ \theta_{t+1,i}=\theta_{t,i}-\frac{\eta}{\sqrt{G_{t,i}}+\epsilon}g_{t,i} \]

3.2 Why It Helps¶

frequent/high-gradient parameters get smaller updates,
infrequent/sparse parameters get relatively larger updates.

This is useful in sparse-feature settings (for example text features).

3.3 Challenge of AdaGrad¶

Because \(G_{t,i}\) keeps growing, effective learning rates can shrink too much over time, causing premature slow convergence.

4. RMSProp (Root Mean Squared Propagation)¶

4.1 Definition¶

RMSProp replaces cumulative sum with an exponential moving average of squared gradients.

Second-moment tracker:

\[ s_t = \rho s_{t-1} + (1-\rho)g_t^2 \]

Update:

\[ \theta_{t+1}=\theta_t-\frac{\eta}{\sqrt{s_t}+\epsilon}g_t \]

4.2 Why RMSProp Solves AdaGrad’s Decay Issue¶

The moving average forgets very old gradients, so step sizes do not collapse as aggressively as AdaGrad.

4.3 Use Cases¶

Commonly effective for non-stationary objectives and deep networks where gradient scales vary across parameters.

5. Adam (Adaptive Moment Estimation)¶

5.1 Definition¶

Adam combines: - momentum-like first moment, - RMSProp-like second moment, - bias correction.

First moment:

\[ m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t \]

Second moment:

\[ v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2 \]

Bias-corrected estimates:

\[ \hat m_t = \frac{m_t}{1-\beta_1^t}, \qquad \hat v_t = \frac{v_t}{1-\beta_2^t} \]

Update:

\[ \theta_{t+1}=\theta_t-\eta\frac{\hat m_t}{\sqrt{\hat v_t}+\epsilon} \]

Typical defaults:

\[ \beta_1=0.9,\;\beta_2=0.999,\;\epsilon\approx10^{-8} \]

5.2 Why Adam Is Popular¶

robust default behavior,
adaptive per-parameter updates,
fast convergence in many deep learning workloads,
handles noisy gradients reasonably well.

5.3 Practical Caveat¶

Adam is a strong default but not always best for generalization on every task; testing against SGD(+momentum) remains good practice.

6. Pseudocode (Professor-Style)¶

6.1 Gradient Descent¶

Initialize θ
for t = 1..T:
    g_t = ∇J(θ_t)
    θ_{t+1} = θ_t - η g_t
return θ_T

6.2 AdaGrad¶

Initialize θ, G = 0
for t = 1..T:
    g_t = ∇J(θ_t)
    G = G + g_t ⊙ g_t
    θ_{t+1} = θ_t - η * g_t / (sqrt(G) + ε)
return θ_T

6.3 RMSProp¶

Initialize θ, s = 0
for t = 1..T:
    g_t = ∇J(θ_t)
    s = ρ s + (1-ρ)(g_t ⊙ g_t)
    θ_{t+1} = θ_t - η * g_t / (sqrt(s) + ε)
return θ_T

6.4 Adam¶

Initialize θ, m = 0, v = 0
for t = 1..T:
    g_t = ∇J(θ_t)
    m = β1 m + (1-β1) g_t
    v = β2 v + (1-β2) (g_t ⊙ g_t)
    m_hat = m / (1-β1^t)
    v_hat = v / (1-β2^t)
    θ_{t+1} = θ_t - η * m_hat / (sqrt(v_hat) + ε)
return θ_T

7. Real-Life ML Use Cases¶

7.1 Linear / Logistic Models¶

GD/SGD variants train linear models at scale.
Common in ad-click prediction, churn scoring, fraud filtering.

7.2 Deep Neural Networks¶

Adam/RMSProp are widely used in vision, NLP, speech, recommendation.
Practical flow: start with Adam baseline, then compare with SGD+momentum for final generalization.

7.3 Sparse Features¶

AdaGrad historically effective in sparse settings (text and CTR features).

8. Geometric Intuition Across Optimizers¶

GD: same step scale in every direction.
AdaGrad: shrinking step size based on historical gradient energy.
RMSProp: adaptive scale with forgetting (recent curvature emphasis).
Adam: momentum + adaptive scaling; smoother trajectory through ravines/saddle regions.

Think of a ball descending a bowl: - momentum helps pass shallow flat zones, - adaptive scaling prevents excessive zig-zag in steep axes.

9. Measurement and Evaluation During Optimization¶

Track these while training: 1. training loss curve vs iterations/epochs, 2. validation loss and target metric (accuracy/F1/AUC/RMSE), 3. gradient norm \(\|\nabla J\|\) (convergence diagnostics), 4. update norm \(\|\Delta\theta\|\) (step stability), 5. wall-clock time to target metric, 6. robustness across random seeds.

Convergence signals: - loss plateau, - small gradient norms, - stable validation metric.

10. Smart Technical Tricks¶

Normalize/standardize inputs before optimization.
Use mini-batch training for speed/stability trade-off.
Apply learning-rate schedules (decay, cosine, warmup where needed).
Use gradient clipping for unstable deep models.
Compare at least two optimizers before final model selection.
Prefer metric-driven selection, not only faster loss decrease.
Use early stopping with validation to reduce overfitting risk.

11. Quick Comparison Table¶

Optimizer	Core idea	Strength	Limitation
GD	Global fixed learning rate	Simple and stable on convex tasks	Slow; sensitive to LR; same step for all parameters
AdaGrad	Per-parameter scaling via cumulative squared gradients	Strong for sparse features	Learning rate can become too small
RMSProp	EMA of squared gradients	Fixes AdaGrad decay issue	Hyperparameter sensitivity remains
Adam	Momentum + RMSProp + bias correction	Strong default for deep learning	Not always best generalization on every task

12. Next Practical Steps¶

Implement all four optimizers on the same task and compare:
final validation metric,
time-to-threshold,
stability over seeds.
Tune learning rate first, then momentum/beta terms.
Add scheduler + early stopping and re-compare.
Choose optimizer by validation outcomes, not popularity.

13. Emerging Research Topics from arXiv (Useful for Modern ML)¶

13.1 AdamW (Decoupled Weight Decay)¶

Key idea: separate weight decay from gradient-based Adam update instead of mixing it into gradient penalty.

Why useful: - often improves generalization over naive L2-with-Adam implementations, - decouples regularization tuning from learning-rate scaling more cleanly.

13.2 AMSGrad (Convergence Fix for Adam Family)¶

Key idea: enforce a non-increasing effective second-moment denominator via max-tracking.

Why useful: - addresses known convergence pathology of Adam in some settings, - important when theoretical convergence guarantees matter.

13.3 Adafactor (Memory-Efficient Adaptive Optimization)¶

Key idea: reduce optimizer-state memory with factored second-moment estimates.

Why useful: - large transformer-scale models where optimizer memory is a bottleneck, - enables training larger models under fixed hardware constraints.

13.4 Lion (Sign-Based Optimizer Discovery)¶

Key idea: optimizer discovered via symbolic search using sign-based momentum updates.

Why useful: - lower state overhead than Adam-like methods, - competitive performance reported in several large-scale workloads.

13.5 Sharpness-Aware Minimization (SAM)¶

Key idea: optimize parameters that remain good under local perturbations, not just pointwise loss minima.

Why useful: - often improves generalization by preferring flatter minima, - practical when validation gap is a primary concern.

13.6 Practical Recommendation from Research¶

For modern deep-learning baselines, compare at least: 1. AdamW 2. SGD + momentum 3. One memory-efficient alternative (Adafactor/Lion where applicable) 4. Optional SAM overlay when generalization is weak

Unconstrained Optimization and Adaptive Optimizers for ML¶

1. What Is Unconstrained Optimization?¶

2. Gradient Descent (GD)¶

2.1 Definition¶

2.2 Challenge of Gradient Descent¶

2.3 Geometric Intuition¶

3. AdaGrad¶

3.1 Definition¶

3.2 Why It Helps¶

3.3 Challenge of AdaGrad¶

4. RMSProp (Root Mean Squared Propagation)¶

4.1 Definition¶

4.2 Why RMSProp Solves AdaGrad’s Decay Issue¶

4.3 Use Cases¶

5. Adam (Adaptive Moment Estimation)¶

5.1 Definition¶

5.2 Why Adam Is Popular¶

5.3 Practical Caveat¶

6. Pseudocode (Professor-Style)¶

6.1 Gradient Descent¶

6.2 AdaGrad¶

6.3 RMSProp¶

6.4 Adam¶

7. Real-Life ML Use Cases¶

7.1 Linear / Logistic Models¶

7.2 Deep Neural Networks¶

7.3 Sparse Features¶

8. Geometric Intuition Across Optimizers¶

9. Measurement and Evaluation During Optimization¶

10. Smart Technical Tricks¶

11. Quick Comparison Table¶

12. Next Practical Steps¶

13. Emerging Research Topics from arXiv (Useful for Modern ML)¶

13.1 AdamW (Decoupled Weight Decay)¶

13.2 AMSGrad (Convergence Fix for Adam Family)¶

13.3 Adafactor (Memory-Efficient Adaptive Optimization)¶

13.4 Lion (Sign-Based Optimizer Discovery)¶

13.5 Sharpness-Aware Minimization (SAM)¶

13.6 Practical Recommendation from Research¶

References¶