Transformer Networks: Concepts, Mathematics, and Workflow¶
Transformers are sequence models built around attention, designed to capture dependencies across tokens without strict recurrent processing.
1. Why Transformers Were Introduced¶
Classical RNN/LSTM sequence modeling has core bottlenecks:
- Sequential computation limits parallelism.
- Long-range dependencies are hard to preserve.
- Sequence-to-sequence tasks with different input/output lengths are difficult to optimize.
Transformers address this by replacing recurrence with attention-based token interactions.
2. From Encoder-Decoder to Dynamic Context¶
Early sequence-to-sequence encoder-decoder models used a single context vector from the encoder's final hidden state. That static bottleneck is weak for long/complex sequences.
Attention replaces static context with dynamic context:
- Each output position can focus on relevant input positions.
- Context is recomputed per token.
- Query-Key-Value (QKV) formalism makes this differentiable and learnable.
flowchart LR
A["Input Tokens"] --> B["Encoder States"]
B --> C["Attention (Dynamic Context)"]
C --> D["Decoder / Output Projection"]
3. Self-Attention Core¶
For input matrix X:
Q = X * W_Q
K = X * W_K
V = X * W_V
Scaled dot-product attention:
Scores = (Q * K^T) / sqrt(d_k)
Alpha = softmax(Scores)
Output = Alpha * V
Where:
- d_k is key dimension used for scaling.
- Alpha is attention-weight matrix (row-wise probabilities).
- Output is the context-aware representation.
Why scale by sqrt(d_k):
- Prevents large dot-product magnitudes.
- Stabilizes softmax gradients.
4. Why Single Self-Attention Is Not Enough¶
A single attention map is often insufficient to capture all linguistic patterns at once. One sentence can contain multiple simultaneous structures:
- subject-verb agreement
- noun modifier relations
- dependency/coreference relations
- named entities
- semantic/contextual relations
This motivates multi-head attention.
5. Multi-Head Attention (MHA)¶
Each head learns a different attention pattern subspace.
Per-head computation:
head_i = Attention(Q_i, K_i, V_i)
Concatenate and project:
MHA(X) = Concat(head_1, head_2, ..., head_h) * W_O
Dimension rule:
d_k = d_v = d_model / h
Worked shape example:
d_model = 12h = 4- per-head
d_k = 3 - each head output is size
3 - concatenated output size is
12 - final projection by
W_Omaps back to model dimension
flowchart LR
X["Input X"] --> H1["Head 1"]
X --> H2["Head 2"]
X --> H3["Head 3"]
X --> H4["Head 4"]
H1 --> C["Concatenate"]
H2 --> C
H3 --> C
H4 --> C
C --> WO["Linear Projection (W_O)"]
WO --> Y["MHA Output"]
Notes: - Heads are parallelizable. - Number of heads is a hyperparameter. - More heads increase representational diversity but add computation.
6. Order Problem and Positional Encoding¶
Attention alone is permutation-invariant; it does not inherently encode token order. So position information is added to token embeddings.
Input to the first Transformer block:
X_input = TokenEmbedding + PositionalEncoding
Standard sinusoidal positional encoding:
PE(pos, 2i) = sin(pos / 10000^(2i / d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i / d_model))
Key properties:
- Depends only on position pos and channel index i.
- Independent of actual token identity.
- Allows model to distinguish sequences with same words but different order.
Example intuition: - "cat chased mouse" vs "mouse chased cat" - Same words, different order, different meaning - Positional encoding makes internal representations different.
7. Transformer Block Structure¶
Encoder block:
Input
-> Multi-Head Self-Attention
-> Add & LayerNorm
-> Position-wise Feed Forward Network
-> Add & LayerNorm
-> Output
Decoder block adds: - masked self-attention (causal mask) - encoder-decoder cross-attention
flowchart TD
A["Token + Positional Embedding"] --> B["Multi-Head Self-Attention"]
B --> C["Add & LayerNorm"]
C --> D["Feed Forward Network"]
D --> E["Add & LayerNorm"]
E --> F["Block Output"]
8. Training vs Inference (Important Distinction)¶
Two stages must be separated conceptually:
- Training:
- learn parameters (
W_Q, W_K, W_V, W_O, FFN weights, embeddings) - learn task behavior from input-output supervision (or pretraining objective)
- Inference:
- use learned parameters to generate/predict outputs
- no gradient updates in standard serving path
In practical systems, models may be periodically updated/retrained as data distribution shifts.
9. End-to-End NLP Pipeline (Transformer View)¶
Raw text
-> Tokenization
-> Token IDs
-> Token Embedding
-> Add Positional Encoding
-> Stacked Transformer Blocks
-> Task Head (classification / generation / tagging)
-> Output
For translation:
Source sentence
-> Encoder representation
-> Decoder (masked self-attn + cross-attn)
-> Target sentence generation
10. Practical Hyperparameters¶
d_model: embedding/model widthh: number of attention headsd_ff: FFN hidden widthN: number of encoder/decoder layers- dropout rates
- max sequence length
General trade-off: - Higher capacity improves expressiveness. - Cost rises in memory/latency.
11. Pseudocode¶
Input: token_ids
X = token_embedding(token_ids)
P = positional_encoding(length(token_ids), d_model)
H = X + P
for layer in 1..N:
A = multi_head_attention(H) # self-attention
H = layer_norm(H + A) # residual 1
F = feed_forward(H)
H = layer_norm(H + F) # residual 2
Output = task_head(H)
12. Quick Revision¶
- Transformer replaces recurrent sequence processing with attention.
- Core operator:
softmax(QK^T / sqrt(d_k))V. - Multi-head attention captures multiple feature patterns in parallel.
- Positional encoding is required to inject order information.
- Final architecture combines MHA, residuals, normalization, and FFN blocks.
- Training and inference should be treated as separate operational phases.