LSTM Networks: Mathematical and Intuitive Notes¶
Long Short-Term Memory (LSTM) networks are a specialized form of recurrent neural networks (RNNs) designed to model sequence data while handling long-range dependencies more effectively than vanilla RNNs.
1. Why LSTM Was Needed¶
Standard RNNs pass hidden state through time, but in practice they struggle when the relevant signal is far in the past (long-term dependency problem).
Typical examples: - language modeling (earlier words affect later predictions), - speech and audio sequences, - time-series forecasting with delayed effects.
LSTM addresses this with controlled memory flow.
2. Core Idea: Cell State as a Memory Highway¶
The key design element is the cell state \(C_t\), which carries memory through time with mostly linear flow.
Intuition: - cell state is a long conveyor belt of information, - gates decide what to forget, write, and expose at each time step.
This gate-based control makes long-term retention easier.
3. LSTM Step-by-Step (Single Time Step)¶
Given input \(x_t\), previous hidden \(h_{t-1}\), and previous cell \(C_{t-1}\):
3.1 Forget Gate¶
Decides what fraction of old memory to keep.
3.2 Input Gate and Candidate Memory¶
Input gate decides where to write; candidate proposes what to write.
3.3 Cell Update¶
Combine retained old memory and gated new candidate.
3.4 Output Gate and Hidden State¶
Expose filtered information as hidden output.
Here: - \(\sigma\) is sigmoid (values in \([0,1]\)), - \(\odot\) is element-wise multiplication.
4. Gate Interpretation (Professor Intuition)¶
- Forget gate \(f_t\): “What memory should I erase?”
- Input gate \(i_t\): “Where should I write new memory?”
- Candidate \(\tilde C_t\): “What content should be written?”
- Output gate \(o_t\): “What part of memory should be revealed now?”
This is why LSTM is robust for sequences that require selective memory persistence.
5. Geometric/Signal Intuition¶
Without gates, repeated nonlinear transforms can dilute useful signal over long horizons.
LSTM introduces controlled linear-like memory transport in \(C_t\), reducing information decay and enabling better long-range credit assignment.
6. Variants Commonly Discussed¶
6.1 Peephole LSTM¶
Gate layers can inspect cell state directly (peephole connections), improving timing-sensitive behaviors in some tasks.
6.2 Coupled Forget/Input Gates¶
Forget and input decisions are tied, reducing parameters.
6.3 GRU (Related Architecture)¶
Gated Recurrent Unit merges some LSTM mechanisms (fewer gates/states), often simpler and faster with competitive performance.
7. Practical Implementation Notes¶
- Normalize/standardize sequence features where appropriate.
- Use sequence batching with padding + masking for variable-length sequences.
- Add dropout or recurrent dropout for regularization.
- Clip gradients in long sequences for training stability.
- Use early stopping on validation metric.
- Compare LSTM vs GRU baseline on your dataset.
8. Pseudocode (Forward Pass for One Sequence)¶
Input: sequence x_1...x_T
Initialize h_0, C_0 = 0
for t = 1..T:
f_t = sigmoid(W_f [h_{t-1}, x_t] + b_f)
i_t = sigmoid(W_i [h_{t-1}, x_t] + b_i)
C~_t = tanh(W_c [h_{t-1}, x_t] + b_c)
C_t = f_t * C_{t-1} + i_t * C~_t
o_t = sigmoid(W_o [h_{t-1}, x_t] + b_o)
h_t = o_t * tanh(C_t)
Return h_1...h_T (or final h_T)
9. Real-World Use Cases¶
- language modeling and text generation,
- sentiment and sequence classification,
- speech recognition,
- multivariate time-series forecasting,
- anomaly detection in temporal logs/signals.
10. Evaluation Checklist¶
Track: 1. training/validation loss by epoch, 2. sequence-specific metric (accuracy/F1/perplexity/RMSE), 3. horizon-wise error for forecasting tasks, 4. overfitting gap, 5. inference latency for long sequences.
11. Summary¶
LSTM improves on vanilla RNN by introducing a gated memory cell that controls forgetting, writing, and reading operations at each time step. This architecture is specifically effective when sequence tasks require learning both short-term and long-term dependencies.