Skip to content

Optimization Algorithms in Deep Neural Networks

Optimization is the process of updating model parameters to minimize a loss function. In deep learning, optimizer choice directly controls convergence speed, stability, and final generalization.


1. Optimization Objective

Given model parameters \(\theta\) and loss \(\mathcal{L}(\theta)\), the training goal is:

\[ \theta^* = \arg\min_{\theta}\mathcal{L}(\theta) \]

The gradient \(\nabla_\theta \mathcal{L}(\theta)\) points toward maximum increase, so updates move in the negative gradient direction.


2. Gradient Descent Foundations

2.1 Batch Gradient Descent

\[ \theta_{t+1} = \theta_t - \eta \nabla_\theta \mathcal{L}(\theta_t) \]
  • Uses the full dataset each update.
  • Stable but expensive for large datasets.

2.2 Stochastic Gradient Descent (SGD)

  • Updates using one sample at a time.
  • Fast per-step but noisy.

2.3 Mini-Batch Gradient Descent

  • Uses a small batch B.

Gradient estimate:

g_t = (1 / |B|) * sum_{(x_i, y_i) in B} grad_theta loss(theta_t; x_i, y_i)

Parameter update:

theta_{t+1} = theta_t - eta * g_t

Parameters: - B: current mini-batch - |B|: mini-batch size - g_t: average gradient at step t - eta: learning rate

  • Standard default in modern deep learning.
  • Balances computational efficiency and gradient quality.

3. Core Challenge: Learning Rate

Learning rate \(\eta\) is the step size.

  • If \(\eta\) is too large: oscillation or divergence.
  • If \(\eta\) is too small: very slow training and stalled progress.
flowchart LR
  A["Large Learning Rate"] --> B["Overshoot Minima"]
  B --> C["Oscillation / Divergence"]
  D["Small Learning Rate"] --> E["Tiny Progress"]
  E --> F["Slow Convergence"]

4. Momentum-Based Methods

4.1 Momentum

Momentum accumulates a velocity vector to smooth updates. Let: [ g_t = \nabla_\theta \mathcal{L}(\theta_t) ]

\[ \begin{aligned} v_t &= \beta v_{t-1} + g_t \\ \theta_{t+1} &= \theta_t - \eta v_t \end{aligned} \]
  • \(\beta\) controls memory of past gradients (commonly around \(0.9\)).
  • Reduces zig-zag behavior in narrow valleys.
  • Improves speed on smooth directions.

4.2 Nesterov Accelerated Gradient (NAG)

Nesterov computes the gradient at a look-ahead point: [ \tilde{\theta}t = \theta_t - \eta \beta v{t-1} ]

\[ \begin{aligned} v_t &= \beta v_{t-1} + \nabla_\theta \mathcal{L}(\tilde{\theta}_t) \\ \theta_{t+1} &= \theta_t - \eta v_t \end{aligned} \]
  • Anticipates future position before updating.
  • Often converges with better directional correction than classical momentum.

5. Adaptive Learning-Rate Methods

5.1 AdaGrad

AdaGrad accumulates squared gradients per parameter.

Update equations: $$ \begin{aligned} s_t &= s_{t-1} + g_t^2 \ \theta_{t+1} &= \theta_t - \frac{\eta}{\sqrt{s_t}+\epsilon}\odot g_t \end{aligned} $$

Implementation form: s = s + g*g, theta = theta - (eta / (sqrt(s) + eps)) * g

Parameters: - \(g_t\): gradient at step \(t\) - \(s_t\): accumulated squared-gradient vector - \(\eta\): base learning rate - \(\epsilon\): small stability constant

Key points: - Larger historical gradients -> smaller future steps. - Works well for sparse features. - Limitation: \(s_t\) grows monotonically, so the effective learning rate can become too small.

5.2 RMSProp

RMSProp uses an exponentially decaying average of squared gradients.

Update equations: $$ \begin{aligned} s_t &= \gamma s_{t-1} + (1-\gamma)g_t^2 \ \theta_{t+1} &= \theta_t - \frac{\eta}{\sqrt{s_t}+\epsilon}\odot g_t \end{aligned} $$

Implementation form: s = gamma*s + (1-gamma)*(g*g), theta = theta - (eta / (sqrt(s) + eps)) * g

Parameters: - \(\gamma\): decay rate (typically near \(0.9\)) - Other symbols are as defined in AdaGrad.

Key points: - Fixes AdaGrad's unbounded accumulation behavior. - Keeps adaptive step sizes useful during long training. - Useful in non-stationary and sequence training settings.

5.3 Adam (Adaptive Moment Estimation)

Adam combines momentum (first moment) and RMS-style scaling (second moment).

Update equations: $$ \begin{aligned} m_t &= \beta_1 m_{t-1} + (1-\beta_1)g_t \ v_t &= \beta_2 v_{t-1} + (1-\beta_2)g_t^2 \ \hat{m}t &= \frac{m_t}{1-\beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1-\beta_2^t} \ \theta{t+1} &= \theta_t - \eta\frac{\hat{m}_t}{\sqrt{\hat{v}_t}+\epsilon} \end{aligned} $$

Implementation form: m = b1*m + (1-b1)*g, v = b2*v + (1-b2)*(g*g),
mhat = m/(1-b1^t), vhat = v/(1-b2^t),
theta = theta - eta * mhat / (sqrt(vhat) + eps)

Parameters: - \(\beta_1\): first-moment decay - \(\beta_2\): second-moment decay - \(m_t\): first moment estimate - \(v_t\): second moment estimate

Typical defaults: - \(\beta_1 = 0.9\) - \(\beta_2 = 0.999\) - \(\epsilon = 10^{-8}\)

5.4 AdamW

AdamW decouples weight decay from the adaptive gradient update.

Update equation: $$ \theta_{t+1} = \theta_t - \eta\frac{\hat{m}_t}{\sqrt{\hat{v}_t}+\epsilon} - \eta\lambda\theta_t $$

Implementation form: theta = theta - eta * mhat / (sqrt(vhat) + eps) - eta * weight_decay * theta

Parameters: - \(\lambda\): weight decay coefficient - Other symbols are as defined in Adam.

Key points: - Better regularization behavior than naive L2 mixed into Adam gradients. - Common default for modern transformer training.


6. Geometric Interpretation of Optimizer Behavior

flowchart TD
  A["Loss Surface"] --> B["SGD: noisy path"]
  A --> C["Momentum: smoothed path"]
  A --> D["AdaGrad/RMSProp/Adam: axis-wise scaled path"]
  B --> E["May bounce in high-curvature directions"]
  C --> F["Builds velocity along consistent descent"]
  D --> G["Takes smaller steps on steep dimensions, larger on flat ones"]
  • SGD follows local slope directly and can oscillate in steep ravines.
  • Momentum damps oscillations and preserves direction.
  • Adaptive methods rescale each parameter's step based on gradient history.

7. Learning-Rate Scheduling

7.1 Step Decay

Formula: [ \eta_t = \eta_0 \cdot \gamma^{\left\lfloor t / s \right\rfloor} ] Parameters: - \(\eta_t\): learning rate at step/epoch \(t\) - \(\eta_0\): initial learning rate - \(s\): decay step interval - \(\gamma\): decay factor, where \(0 < \gamma < 1\)

Interpretation: - Learning rate remains constant for each block of \(s\) steps and drops by factor \(\gamma\) at the boundary.

7.2 Exponential Decay

Formula: [ \eta_t = \eta_0 e^{-\lambda t} ] Parameters: - \(\eta_t\): learning rate at step/epoch \(t\) - \(\eta_0\): initial learning rate - \(\lambda\): decay constant (\(\lambda > 0\))

Interpretation: - Learning rate decreases smoothly and continuously over time.

7.3 Cosine Annealing

Formula:

eta_t = eta_min + 0.5 * (eta_max - eta_min) * (1 + cos(pi * t / T))
Parameters: - eta_t: learning rate at step/epoch t - eta_max: maximum learning rate - eta_min: minimum learning rate - T: total schedule length (steps or epochs)

Interpretation: - Learning rate follows a cosine curve from eta_max down to eta_min, usually giving stable late-stage convergence.

7.4 Warmup

Definition: - Warmup starts from a small learning rate and increases it gradually during the first few steps/epochs.

Simple linear warmup formula:

eta_t = eta_start + (t / T_w) * (eta_target - eta_start),   for 0 <= t <= T_w

Parameters: - eta_start: initial warmup learning rate - eta_target: learning rate reached at warmup end - T_w: warmup duration

Interpretation: - Warmup prevents unstable jumps at the beginning of training and is especially useful with large models or large batch sizes.


8. Gradient Clipping

Controls exploding gradients.

8.1 Value Clipping

\[ g_i \leftarrow \mathrm{clip}(g_i, -\tau, \tau) \]

8.2 Norm Clipping

\[ g \leftarrow g \cdot \min\left(1,\frac{\tau}{\|g\|_2}\right) \]

Norm clipping is commonly preferred because it preserves direction.


9. Batch Size and Optimization Dynamics

  • Smaller batches:
  • Noisier gradients.
  • Better exploration and often better generalization.
  • Lower memory requirement.
  • Larger batches:
  • Smoother gradients.
  • Better hardware throughput.
  • Usually require careful learning-rate scaling and warmup.

10. Practical Optimizer Selection

  • Start with Adam/AdamW for rapid baseline development.
  • Try SGD + Momentum for final performance tuning when stable training is achieved.
  • Use RMSProp when adaptive behavior is needed and training is sensitive to non-stationary gradients.

No optimizer is universally best; validate empirically on your dataset and architecture.


11. Optimization Debugging Checklist

  • Loss diverges quickly -> lower learning rate, apply gradient clipping.
  • Loss oscillates -> reduce learning rate, add momentum or stronger smoothing.
  • Loss decreases too slowly -> increase learning rate or tune schedule.
  • Loss plateaus early -> use adaptive optimizer, scheduler, or better initialization.
  • Train loss decreases but validation degrades -> apply regularization and early stopping.

12. Pseudocode

12.1 Generic Mini-Batch Training Loop

Initialize model parameters theta
Initialize optimizer state S
for epoch in 1..E:
    for mini-batch B in training_data:
        g = grad(loss(theta; B))
        g = clip_if_needed(g)
        theta, S = optimizer_update(theta, g, S)
    adjust_learning_rate_if_scheduler_used()
    evaluate_on_validation_set()

12.2 Adam Update Step

Input: theta, g_t, m_{t-1}, v_{t-1}, beta1, beta2, lr, eps, t
m_t = beta1 * m_{t-1} + (1 - beta1) * g_t
v_t = beta2 * v_{t-1} + (1 - beta2) * (g_t * g_t)
m_hat = m_t / (1 - beta1^t)
v_hat = v_t / (1 - beta2^t)
theta = theta - lr * m_hat / (sqrt(v_hat) + eps)

13. Real-World Use Cases

  • Computer vision classification: AdamW + cosine schedule + warmup for stable deep-network training.
  • NLP with sparse token features: adaptive optimizers improve per-parameter learning behavior.
  • Time-series forecasting: RMSProp/Adam with gradient clipping often improves recurrent model stability.

14. Summary

  • Optimization controls how efficiently a neural network learns from gradients.
  • Learning rate is the most critical hyperparameter.
  • Momentum improves direction stability; adaptive methods improve per-parameter step control.
  • Scheduling, clipping, and batch-size strategy are essential for robust convergence.
  • Use validation-driven comparisons to choose optimizer and hyperparameters for a specific task.