Optimization Algorithms in Deep Neural Networks¶
Optimization is the process of updating model parameters to minimize a loss function. In deep learning, optimizer choice directly controls convergence speed, stability, and final generalization.
1. Optimization Objective¶
Given model parameters \(\theta\) and loss \(\mathcal{L}(\theta)\), the training goal is:
The gradient \(\nabla_\theta \mathcal{L}(\theta)\) points toward maximum increase, so updates move in the negative gradient direction.
2. Gradient Descent Foundations¶
2.1 Batch Gradient Descent¶
- Uses the full dataset each update.
- Stable but expensive for large datasets.
2.2 Stochastic Gradient Descent (SGD)¶
- Updates using one sample at a time.
- Fast per-step but noisy.
2.3 Mini-Batch Gradient Descent¶
- Uses a small batch
B.
Gradient estimate:
g_t = (1 / |B|) * sum_{(x_i, y_i) in B} grad_theta loss(theta_t; x_i, y_i)
Parameter update:
theta_{t+1} = theta_t - eta * g_t
Parameters:
- B: current mini-batch
- |B|: mini-batch size
- g_t: average gradient at step t
- eta: learning rate
- Standard default in modern deep learning.
- Balances computational efficiency and gradient quality.
3. Core Challenge: Learning Rate¶
Learning rate \(\eta\) is the step size.
- If \(\eta\) is too large: oscillation or divergence.
- If \(\eta\) is too small: very slow training and stalled progress.
flowchart LR
A["Large Learning Rate"] --> B["Overshoot Minima"]
B --> C["Oscillation / Divergence"]
D["Small Learning Rate"] --> E["Tiny Progress"]
E --> F["Slow Convergence"]
4. Momentum-Based Methods¶
4.1 Momentum¶
Momentum accumulates a velocity vector to smooth updates. Let: [ g_t = \nabla_\theta \mathcal{L}(\theta_t) ]
- \(\beta\) controls memory of past gradients (commonly around \(0.9\)).
- Reduces zig-zag behavior in narrow valleys.
- Improves speed on smooth directions.
4.2 Nesterov Accelerated Gradient (NAG)¶
Nesterov computes the gradient at a look-ahead point: [ \tilde{\theta}t = \theta_t - \eta \beta v{t-1} ]
- Anticipates future position before updating.
- Often converges with better directional correction than classical momentum.
5. Adaptive Learning-Rate Methods¶
5.1 AdaGrad¶
AdaGrad accumulates squared gradients per parameter.
Update equations: $$ \begin{aligned} s_t &= s_{t-1} + g_t^2 \ \theta_{t+1} &= \theta_t - \frac{\eta}{\sqrt{s_t}+\epsilon}\odot g_t \end{aligned} $$
Implementation form:
s = s + g*g, theta = theta - (eta / (sqrt(s) + eps)) * g
Parameters: - \(g_t\): gradient at step \(t\) - \(s_t\): accumulated squared-gradient vector - \(\eta\): base learning rate - \(\epsilon\): small stability constant
Key points: - Larger historical gradients -> smaller future steps. - Works well for sparse features. - Limitation: \(s_t\) grows monotonically, so the effective learning rate can become too small.
5.2 RMSProp¶
RMSProp uses an exponentially decaying average of squared gradients.
Update equations: $$ \begin{aligned} s_t &= \gamma s_{t-1} + (1-\gamma)g_t^2 \ \theta_{t+1} &= \theta_t - \frac{\eta}{\sqrt{s_t}+\epsilon}\odot g_t \end{aligned} $$
Implementation form:
s = gamma*s + (1-gamma)*(g*g), theta = theta - (eta / (sqrt(s) + eps)) * g
Parameters: - \(\gamma\): decay rate (typically near \(0.9\)) - Other symbols are as defined in AdaGrad.
Key points: - Fixes AdaGrad's unbounded accumulation behavior. - Keeps adaptive step sizes useful during long training. - Useful in non-stationary and sequence training settings.
5.3 Adam (Adaptive Moment Estimation)¶
Adam combines momentum (first moment) and RMS-style scaling (second moment).
Update equations: $$ \begin{aligned} m_t &= \beta_1 m_{t-1} + (1-\beta_1)g_t \ v_t &= \beta_2 v_{t-1} + (1-\beta_2)g_t^2 \ \hat{m}t &= \frac{m_t}{1-\beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1-\beta_2^t} \ \theta{t+1} &= \theta_t - \eta\frac{\hat{m}_t}{\sqrt{\hat{v}_t}+\epsilon} \end{aligned} $$
Implementation form:
m = b1*m + (1-b1)*g, v = b2*v + (1-b2)*(g*g),
mhat = m/(1-b1^t), vhat = v/(1-b2^t),
theta = theta - eta * mhat / (sqrt(vhat) + eps)
Parameters: - \(\beta_1\): first-moment decay - \(\beta_2\): second-moment decay - \(m_t\): first moment estimate - \(v_t\): second moment estimate
Typical defaults: - \(\beta_1 = 0.9\) - \(\beta_2 = 0.999\) - \(\epsilon = 10^{-8}\)
5.4 AdamW¶
AdamW decouples weight decay from the adaptive gradient update.
Update equation: $$ \theta_{t+1} = \theta_t - \eta\frac{\hat{m}_t}{\sqrt{\hat{v}_t}+\epsilon} - \eta\lambda\theta_t $$
Implementation form:
theta = theta - eta * mhat / (sqrt(vhat) + eps) - eta * weight_decay * theta
Parameters: - \(\lambda\): weight decay coefficient - Other symbols are as defined in Adam.
Key points: - Better regularization behavior than naive L2 mixed into Adam gradients. - Common default for modern transformer training.
6. Geometric Interpretation of Optimizer Behavior¶
flowchart TD
A["Loss Surface"] --> B["SGD: noisy path"]
A --> C["Momentum: smoothed path"]
A --> D["AdaGrad/RMSProp/Adam: axis-wise scaled path"]
B --> E["May bounce in high-curvature directions"]
C --> F["Builds velocity along consistent descent"]
D --> G["Takes smaller steps on steep dimensions, larger on flat ones"]
- SGD follows local slope directly and can oscillate in steep ravines.
- Momentum damps oscillations and preserves direction.
- Adaptive methods rescale each parameter's step based on gradient history.
7. Learning-Rate Scheduling¶
7.1 Step Decay¶
Formula: [ \eta_t = \eta_0 \cdot \gamma^{\left\lfloor t / s \right\rfloor} ] Parameters: - \(\eta_t\): learning rate at step/epoch \(t\) - \(\eta_0\): initial learning rate - \(s\): decay step interval - \(\gamma\): decay factor, where \(0 < \gamma < 1\)
Interpretation: - Learning rate remains constant for each block of \(s\) steps and drops by factor \(\gamma\) at the boundary.
7.2 Exponential Decay¶
Formula: [ \eta_t = \eta_0 e^{-\lambda t} ] Parameters: - \(\eta_t\): learning rate at step/epoch \(t\) - \(\eta_0\): initial learning rate - \(\lambda\): decay constant (\(\lambda > 0\))
Interpretation: - Learning rate decreases smoothly and continuously over time.
7.3 Cosine Annealing¶
Formula:
eta_t = eta_min + 0.5 * (eta_max - eta_min) * (1 + cos(pi * t / T))
eta_t: learning rate at step/epoch t
- eta_max: maximum learning rate
- eta_min: minimum learning rate
- T: total schedule length (steps or epochs)
Interpretation:
- Learning rate follows a cosine curve from eta_max down to eta_min, usually giving stable late-stage convergence.
7.4 Warmup¶
Definition: - Warmup starts from a small learning rate and increases it gradually during the first few steps/epochs.
Simple linear warmup formula:
eta_t = eta_start + (t / T_w) * (eta_target - eta_start), for 0 <= t <= T_w
Parameters:
- eta_start: initial warmup learning rate
- eta_target: learning rate reached at warmup end
- T_w: warmup duration
Interpretation: - Warmup prevents unstable jumps at the beginning of training and is especially useful with large models or large batch sizes.
8. Gradient Clipping¶
Controls exploding gradients.
8.1 Value Clipping¶
8.2 Norm Clipping¶
Norm clipping is commonly preferred because it preserves direction.
9. Batch Size and Optimization Dynamics¶
- Smaller batches:
- Noisier gradients.
- Better exploration and often better generalization.
- Lower memory requirement.
- Larger batches:
- Smoother gradients.
- Better hardware throughput.
- Usually require careful learning-rate scaling and warmup.
10. Practical Optimizer Selection¶
- Start with Adam/AdamW for rapid baseline development.
- Try SGD + Momentum for final performance tuning when stable training is achieved.
- Use RMSProp when adaptive behavior is needed and training is sensitive to non-stationary gradients.
No optimizer is universally best; validate empirically on your dataset and architecture.
11. Optimization Debugging Checklist¶
- Loss diverges quickly -> lower learning rate, apply gradient clipping.
- Loss oscillates -> reduce learning rate, add momentum or stronger smoothing.
- Loss decreases too slowly -> increase learning rate or tune schedule.
- Loss plateaus early -> use adaptive optimizer, scheduler, or better initialization.
- Train loss decreases but validation degrades -> apply regularization and early stopping.
12. Pseudocode¶
12.1 Generic Mini-Batch Training Loop¶
Initialize model parameters theta
Initialize optimizer state S
for epoch in 1..E:
for mini-batch B in training_data:
g = grad(loss(theta; B))
g = clip_if_needed(g)
theta, S = optimizer_update(theta, g, S)
adjust_learning_rate_if_scheduler_used()
evaluate_on_validation_set()
12.2 Adam Update Step¶
Input: theta, g_t, m_{t-1}, v_{t-1}, beta1, beta2, lr, eps, t
m_t = beta1 * m_{t-1} + (1 - beta1) * g_t
v_t = beta2 * v_{t-1} + (1 - beta2) * (g_t * g_t)
m_hat = m_t / (1 - beta1^t)
v_hat = v_t / (1 - beta2^t)
theta = theta - lr * m_hat / (sqrt(v_hat) + eps)
13. Real-World Use Cases¶
- Computer vision classification: AdamW + cosine schedule + warmup for stable deep-network training.
- NLP with sparse token features: adaptive optimizers improve per-parameter learning behavior.
- Time-series forecasting: RMSProp/Adam with gradient clipping often improves recurrent model stability.
14. Summary¶
- Optimization controls how efficiently a neural network learns from gradients.
- Learning rate is the most critical hyperparameter.
- Momentum improves direction stability; adaptive methods improve per-parameter step control.
- Scheduling, clipping, and batch-size strategy are essential for robust convergence.
- Use validation-driven comparisons to choose optimizer and hyperparameters for a specific task.