Optimization Algorithms in Deep Neural Networks¶

Optimization is the process of updating model parameters to minimize a loss function. In deep learning, optimizer choice directly controls convergence speed, stability, and final generalization.

1. Optimization Objective¶

Given model parameters $\theta$ and loss $\mathcal{L}(\theta)$, the training goal is:

\[ \theta^* = \arg\min_{\theta}\mathcal{L}(\theta) \]

The gradient $\nabla_\theta \mathcal{L}(\theta)$ points toward maximum increase, so updates move in the negative gradient direction.

2. Gradient Descent Foundations¶

2.1 Batch Gradient Descent¶

\[ \theta_{t+1} = \theta_t - \eta \nabla_\theta \mathcal{L}(\theta_t) \]

Uses the full dataset each update.
Stable but expensive for large datasets.

2.2 Stochastic Gradient Descent (SGD)¶

Updates using one sample at a time.
Fast per-step but noisy.

2.3 Mini-Batch Gradient Descent¶

Uses a small batch B.

Gradient estimate:

g_t = (1 / |B|) * sum_{(x_i, y_i) in B} grad_theta loss(theta_t; x_i, y_i)

Parameter update:

theta_{t+1} = theta_t - eta * g_t

Parameters: - B: current mini-batch - |B|: mini-batch size - g_t: average gradient at step t - eta: learning rate

Standard default in modern deep learning.
Balances computational efficiency and gradient quality.

3. Core Challenge: Learning Rate¶

Learning rate $\eta$ is the step size.

If $\eta$ is too large: oscillation or divergence.
If $\eta$ is too small: very slow training and stalled progress.

flowchart LR
  A["Large Learning Rate"] --> B["Overshoot Minima"]
  B --> C["Oscillation / Divergence"]
  D["Small Learning Rate"] --> E["Tiny Progress"]
  E --> F["Slow Convergence"]

4. Momentum-Based Methods¶

4.1 Momentum¶

Momentum accumulates a velocity vector to smooth updates. Let: [ g_t = \nabla_\theta \mathcal{L}(\theta_t) ]

\[ \begin{aligned} v_t &= \beta v_{t-1} + g_t \\ \theta_{t+1} &= \theta_t - \eta v_t \end{aligned} \]

$\beta$ controls memory of past gradients (commonly around $0.9$).
Reduces zig-zag behavior in narrow valleys.
Improves speed on smooth directions.

4.2 Nesterov Accelerated Gradient (NAG)¶

Nesterov computes the gradient at a look-ahead point: [ \tilde{\theta}t = \theta_t - \eta \beta v{t-1} ]

\[ \begin{aligned} v_t &= \beta v_{t-1} + \nabla_\theta \mathcal{L}(\tilde{\theta}_t) \\ \theta_{t+1} &= \theta_t - \eta v_t \end{aligned} \]

Anticipates future position before updating.
Often converges with better directional correction than classical momentum.

5. Adaptive Learning-Rate Methods¶

5.1 AdaGrad¶

AdaGrad accumulates squared gradients per parameter.

Update equations: $$ \begin{aligned} s_t &= s_{t-1} + g_t^2 \ \theta_{t+1} &= \theta_t - \frac{\eta}{\sqrt{s_t}+\epsilon}\odot g_t \end{aligned} $$

Implementation form: s = s + g*g, theta = theta - (eta / (sqrt(s) + eps)) * g

Parameters: - $g_t$: gradient at step $t$ - $s_t$: accumulated squared-gradient vector - $\eta$: base learning rate - $\epsilon$: small stability constant

Key points: - Larger historical gradients -> smaller future steps. - Works well for sparse features. - Limitation: $s_t$ grows monotonically, so the effective learning rate can become too small.

5.2 RMSProp¶

RMSProp uses an exponentially decaying average of squared gradients.

Update equations: $$ \begin{aligned} s_t &= \gamma s_{t-1} + (1-\gamma)g_t^2 \ \theta_{t+1} &= \theta_t - \frac{\eta}{\sqrt{s_t}+\epsilon}\odot g_t \end{aligned} $$

Implementation form: s = gamma*s + (1-gamma)*(g*g), theta = theta - (eta / (sqrt(s) + eps)) * g

Parameters: - $\gamma$: decay rate (typically near $0.9$) - Other symbols are as defined in AdaGrad.

Key points: - Fixes AdaGrad's unbounded accumulation behavior. - Keeps adaptive step sizes useful during long training. - Useful in non-stationary and sequence training settings.

5.3 Adam (Adaptive Moment Estimation)¶

Adam combines momentum (first moment) and RMS-style scaling (second moment).

Update equations: $$ \begin{aligned} m_t &= \beta_1 m_{t-1} + (1-\beta_1)g_t \ v_t &= \beta_2 v_{t-1} + (1-\beta_2)g_t^2 \ \hat{m}t &= \frac{m_t}{1-\beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1-\beta_2^t} \ \theta{t+1} &= \theta_t - \eta\frac{\hat{m}_t}{\sqrt{\hat{v}_t}+\epsilon} \end{aligned} $$

Implementation form: m = b1*m + (1-b1)*g, v = b2*v + (1-b2)*(g*g),
mhat = m/(1-b1^t), vhat = v/(1-b2^t),
theta = theta - eta * mhat / (sqrt(vhat) + eps)

Parameters: - $\beta_1$: first-moment decay - $\beta_2$: second-moment decay - $m_t$: first moment estimate - $v_t$: second moment estimate

Typical defaults: - $\beta_1 = 0.9$ - $\beta_2 = 0.999$ - $\epsilon = 10^{-8}$

5.4 AdamW¶

AdamW decouples weight decay from the adaptive gradient update.

Update equation: $$ \theta_{t+1} = \theta_t - \eta\frac{\hat{m}_t}{\sqrt{\hat{v}_t}+\epsilon} - \eta\lambda\theta_t $$

Implementation form: theta = theta - eta * mhat / (sqrt(vhat) + eps) - eta * weight_decay * theta

Parameters: - $\lambda$: weight decay coefficient - Other symbols are as defined in Adam.

Key points: - Better regularization behavior than naive L2 mixed into Adam gradients. - Common default for modern transformer training.

6. Geometric Interpretation of Optimizer Behavior¶

flowchart TD
  A["Loss Surface"] --> B["SGD: noisy path"]
  A --> C["Momentum: smoothed path"]
  A --> D["AdaGrad/RMSProp/Adam: axis-wise scaled path"]
  B --> E["May bounce in high-curvature directions"]
  C --> F["Builds velocity along consistent descent"]
  D --> G["Takes smaller steps on steep dimensions, larger on flat ones"]

SGD follows local slope directly and can oscillate in steep ravines.
Momentum damps oscillations and preserves direction.
Adaptive methods rescale each parameter's step based on gradient history.

7. Learning-Rate Scheduling¶

7.1 Step Decay¶

Formula: [ \eta_t = \eta_0 \cdot \gamma^{\left\lfloor t / s \right\rfloor} ] Parameters: - $\eta_t$: learning rate at step/epoch $t$ - $\eta_0$: initial learning rate - $s$: decay step interval - $\gamma$: decay factor, where $0 < \gamma < 1$

Interpretation: - Learning rate remains constant for each block of $s$ steps and drops by factor $\gamma$ at the boundary.

7.2 Exponential Decay¶

Formula: [ \eta_t = \eta_0 e^{-\lambda t} ] Parameters: - $\eta_t$: learning rate at step/epoch $t$ - $\eta_0$: initial learning rate - $\lambda$: decay constant ($\lambda > 0$)

Interpretation: - Learning rate decreases smoothly and continuously over time.

7.3 Cosine Annealing¶

Formula:

eta_t = eta_min + 0.5 * (eta_max - eta_min) * (1 + cos(pi * t / T))

Parameters: - eta_t: learning rate at step/epoch t - eta_max: maximum learning rate - eta_min: minimum learning rate - T: total schedule length (steps or epochs)

Interpretation: - Learning rate follows a cosine curve from eta_max down to eta_min, usually giving stable late-stage convergence.

7.4 Warmup¶

Definition: - Warmup starts from a small learning rate and increases it gradually during the first few steps/epochs.

Simple linear warmup formula:

eta_t = eta_start + (t / T_w) * (eta_target - eta_start),   for 0 <= t <= T_w

Parameters: - eta_start: initial warmup learning rate - eta_target: learning rate reached at warmup end - T_w: warmup duration

Interpretation: - Warmup prevents unstable jumps at the beginning of training and is especially useful with large models or large batch sizes.

8. Gradient Clipping¶

Controls exploding gradients.

8.1 Value Clipping¶

\[ g_i \leftarrow \mathrm{clip}(g_i, -\tau, \tau) \]

8.2 Norm Clipping¶

\[ g \leftarrow g \cdot \min\left(1,\frac{\tau}{\|g\|_2}\right) \]

Norm clipping is commonly preferred because it preserves direction.

9. Batch Size and Optimization Dynamics¶

Smaller batches:
Noisier gradients.
Better exploration and often better generalization.
Lower memory requirement.
Larger batches:
Smoother gradients.
Better hardware throughput.
Usually require careful learning-rate scaling and warmup.

10. Practical Optimizer Selection¶

Start with Adam/AdamW for rapid baseline development.
Try SGD + Momentum for final performance tuning when stable training is achieved.
Use RMSProp when adaptive behavior is needed and training is sensitive to non-stationary gradients.

No optimizer is universally best; validate empirically on your dataset and architecture.

11. Optimization Debugging Checklist¶

Loss diverges quickly -> lower learning rate, apply gradient clipping.
Loss oscillates -> reduce learning rate, add momentum or stronger smoothing.
Loss decreases too slowly -> increase learning rate or tune schedule.
Loss plateaus early -> use adaptive optimizer, scheduler, or better initialization.
Train loss decreases but validation degrades -> apply regularization and early stopping.

12. Pseudocode¶

12.1 Generic Mini-Batch Training Loop¶

Initialize model parameters theta
Initialize optimizer state S
for epoch in 1..E:
    for mini-batch B in training_data:
        g = grad(loss(theta; B))
        g = clip_if_needed(g)
        theta, S = optimizer_update(theta, g, S)
    adjust_learning_rate_if_scheduler_used()
    evaluate_on_validation_set()

12.2 Adam Update Step¶

Input: theta, g_t, m_{t-1}, v_{t-1}, beta1, beta2, lr, eps, t
m_t = beta1 * m_{t-1} + (1 - beta1) * g_t
v_t = beta2 * v_{t-1} + (1 - beta2) * (g_t * g_t)
m_hat = m_t / (1 - beta1^t)
v_hat = v_t / (1 - beta2^t)
theta = theta - lr * m_hat / (sqrt(v_hat) + eps)

13. Real-World Use Cases¶

Computer vision classification: AdamW + cosine schedule + warmup for stable deep-network training.
NLP with sparse token features: adaptive optimizers improve per-parameter learning behavior.
Time-series forecasting: RMSProp/Adam with gradient clipping often improves recurrent model stability.

14. Summary¶

Optimization controls how efficiently a neural network learns from gradients.
Learning rate is the most critical hyperparameter.
Momentum improves direction stability; adaptive methods improve per-parameter step control.
Scheduling, clipping, and batch-size strategy are essential for robust convergence.
Use validation-driven comparisons to choose optimizer and hyperparameters for a specific task.