Skip to content

Unconstrained Optimization and Adaptive Optimizers for ML

This chapter covers unconstrained optimization in machine learning, from classical Gradient Descent to adaptive methods such as AdaGrad, RMSProp, and Adam.


1. What Is Unconstrained Optimization?

In unconstrained optimization, we minimize an objective without explicit feasibility constraints:

\[ \min_{\theta} J(\theta) \]

where: - \(\theta\) = model parameters, - \(J(\theta)\) = loss (objective) function.

In ML, training is typically solving this optimization problem iteratively.


2. Gradient Descent (GD)

2.1 Definition

Gradient Descent follows the negative gradient direction to reduce loss.

Update rule:

\[ \theta_{t+1} = \theta_t - \eta \nabla J(\theta_t) \]

where \(\eta\) is the learning rate.

2.2 Challenge of Gradient Descent

A key limitation is that GD uses the same learning rate for all parameters.

Implications: - slow progress in flat directions, - oscillation in steep directions, - sensitive tuning of \(\eta\).

2.3 Geometric Intuition

Imagine descending a valley: - too small \(\eta\): tiny slow steps, - too large \(\eta\): bouncing/overshoot, - anisotropic curvature: zig-zag trajectories.


3. AdaGrad

3.1 Definition

AdaGrad adapts step size per-parameter using accumulated squared gradients.

Per-parameter accumulator:

\[ G_{t,i}=\sum_{\tau=1}^{t} g_{\tau,i}^2 \]

Update:

\[ \theta_{t+1,i}=\theta_{t,i}-\frac{\eta}{\sqrt{G_{t,i}}+\epsilon}g_{t,i} \]

3.2 Why It Helps

  • frequent/high-gradient parameters get smaller updates,
  • infrequent/sparse parameters get relatively larger updates.

This is useful in sparse-feature settings (for example text features).

3.3 Challenge of AdaGrad

Because \(G_{t,i}\) keeps growing, effective learning rates can shrink too much over time, causing premature slow convergence.


4. RMSProp (Root Mean Squared Propagation)

4.1 Definition

RMSProp replaces cumulative sum with an exponential moving average of squared gradients.

Second-moment tracker:

\[ s_t = \rho s_{t-1} + (1-\rho)g_t^2 \]

Update:

\[ \theta_{t+1}=\theta_t-\frac{\eta}{\sqrt{s_t}+\epsilon}g_t \]

4.2 Why RMSProp Solves AdaGrad’s Decay Issue

The moving average forgets very old gradients, so step sizes do not collapse as aggressively as AdaGrad.

4.3 Use Cases

Commonly effective for non-stationary objectives and deep networks where gradient scales vary across parameters.


5. Adam (Adaptive Moment Estimation)

5.1 Definition

Adam combines: - momentum-like first moment, - RMSProp-like second moment, - bias correction.

First moment:

\[ m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t \]

Second moment:

\[ v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2 \]

Bias-corrected estimates:

\[ \hat m_t = \frac{m_t}{1-\beta_1^t}, \qquad \hat v_t = \frac{v_t}{1-\beta_2^t} \]

Update:

\[ \theta_{t+1}=\theta_t-\eta\frac{\hat m_t}{\sqrt{\hat v_t}+\epsilon} \]

Typical defaults:

\[ \beta_1=0.9,\;\beta_2=0.999,\;\epsilon\approx10^{-8} \]
  • robust default behavior,
  • adaptive per-parameter updates,
  • fast convergence in many deep learning workloads,
  • handles noisy gradients reasonably well.

5.3 Practical Caveat

Adam is a strong default but not always best for generalization on every task; testing against SGD(+momentum) remains good practice.


6. Pseudocode (Professor-Style)

6.1 Gradient Descent

Initialize θ
for t = 1..T:
    g_t = ∇J(θ_t)
    θ_{t+1} = θ_t - η g_t
return θ_T

6.2 AdaGrad

Initialize θ, G = 0
for t = 1..T:
    g_t = ∇J(θ_t)
    G = G + g_t ⊙ g_t
    θ_{t+1} = θ_t - η * g_t / (sqrt(G) + ε)
return θ_T

6.3 RMSProp

Initialize θ, s = 0
for t = 1..T:
    g_t = ∇J(θ_t)
    s = ρ s + (1-ρ)(g_t ⊙ g_t)
    θ_{t+1} = θ_t - η * g_t / (sqrt(s) + ε)
return θ_T

6.4 Adam

Initialize θ, m = 0, v = 0
for t = 1..T:
    g_t = ∇J(θ_t)
    m = β1 m + (1-β1) g_t
    v = β2 v + (1-β2) (g_t ⊙ g_t)
    m_hat = m / (1-β1^t)
    v_hat = v / (1-β2^t)
    θ_{t+1} = θ_t - η * m_hat / (sqrt(v_hat) + ε)
return θ_T

7. Real-Life ML Use Cases

7.1 Linear / Logistic Models

  • GD/SGD variants train linear models at scale.
  • Common in ad-click prediction, churn scoring, fraud filtering.

7.2 Deep Neural Networks

  • Adam/RMSProp are widely used in vision, NLP, speech, recommendation.
  • Practical flow: start with Adam baseline, then compare with SGD+momentum for final generalization.

7.3 Sparse Features

  • AdaGrad historically effective in sparse settings (text and CTR features).

8. Geometric Intuition Across Optimizers

  • GD: same step scale in every direction.
  • AdaGrad: shrinking step size based on historical gradient energy.
  • RMSProp: adaptive scale with forgetting (recent curvature emphasis).
  • Adam: momentum + adaptive scaling; smoother trajectory through ravines/saddle regions.

Think of a ball descending a bowl: - momentum helps pass shallow flat zones, - adaptive scaling prevents excessive zig-zag in steep axes.


9. Measurement and Evaluation During Optimization

Track these while training: 1. training loss curve vs iterations/epochs, 2. validation loss and target metric (accuracy/F1/AUC/RMSE), 3. gradient norm \(\|\nabla J\|\) (convergence diagnostics), 4. update norm \(\|\Delta\theta\|\) (step stability), 5. wall-clock time to target metric, 6. robustness across random seeds.

Convergence signals: - loss plateau, - small gradient norms, - stable validation metric.


10. Smart Technical Tricks

  1. Normalize/standardize inputs before optimization.
  2. Use mini-batch training for speed/stability trade-off.
  3. Apply learning-rate schedules (decay, cosine, warmup where needed).
  4. Use gradient clipping for unstable deep models.
  5. Compare at least two optimizers before final model selection.
  6. Prefer metric-driven selection, not only faster loss decrease.
  7. Use early stopping with validation to reduce overfitting risk.

11. Quick Comparison Table

Optimizer Core idea Strength Limitation
GD Global fixed learning rate Simple and stable on convex tasks Slow; sensitive to LR; same step for all parameters
AdaGrad Per-parameter scaling via cumulative squared gradients Strong for sparse features Learning rate can become too small
RMSProp EMA of squared gradients Fixes AdaGrad decay issue Hyperparameter sensitivity remains
Adam Momentum + RMSProp + bias correction Strong default for deep learning Not always best generalization on every task

12. Next Practical Steps

  1. Implement all four optimizers on the same task and compare:
  2. final validation metric,
  3. time-to-threshold,
  4. stability over seeds.
  5. Tune learning rate first, then momentum/beta terms.
  6. Add scheduler + early stopping and re-compare.
  7. Choose optimizer by validation outcomes, not popularity.

13. Emerging Research Topics from arXiv (Useful for Modern ML)

13.1 AdamW (Decoupled Weight Decay)

Key idea: separate weight decay from gradient-based Adam update instead of mixing it into gradient penalty.

Why useful: - often improves generalization over naive L2-with-Adam implementations, - decouples regularization tuning from learning-rate scaling more cleanly.

13.2 AMSGrad (Convergence Fix for Adam Family)

Key idea: enforce a non-increasing effective second-moment denominator via max-tracking.

Why useful: - addresses known convergence pathology of Adam in some settings, - important when theoretical convergence guarantees matter.

13.3 Adafactor (Memory-Efficient Adaptive Optimization)

Key idea: reduce optimizer-state memory with factored second-moment estimates.

Why useful: - large transformer-scale models where optimizer memory is a bottleneck, - enables training larger models under fixed hardware constraints.

13.4 Lion (Sign-Based Optimizer Discovery)

Key idea: optimizer discovered via symbolic search using sign-based momentum updates.

Why useful: - lower state overhead than Adam-like methods, - competitive performance reported in several large-scale workloads.

13.5 Sharpness-Aware Minimization (SAM)

Key idea: optimize parameters that remain good under local perturbations, not just pointwise loss minima.

Why useful: - often improves generalization by preferring flatter minima, - practical when validation gap is a primary concern.

13.6 Practical Recommendation from Research

For modern deep-learning baselines, compare at least: 1. AdamW 2. SGD + momentum 3. One memory-efficient alternative (Adafactor/Lion where applicable) 4. Optional SAM overlay when generalization is weak


References