Transformer Networks: Concepts, Mathematics, and Workflow¶

Transformers are sequence models built around attention, designed to capture dependencies across tokens without strict recurrent processing.

1. Why Transformers Were Introduced¶

Classical RNN/LSTM sequence modeling has core bottlenecks:

Sequential computation limits parallelism.
Long-range dependencies are hard to preserve.
Sequence-to-sequence tasks with different input/output lengths are difficult to optimize.

Transformers address this by replacing recurrence with attention-based token interactions.

2. From Encoder-Decoder to Dynamic Context¶

Early sequence-to-sequence encoder-decoder models used a single context vector from the encoder's final hidden state. That static bottleneck is weak for long/complex sequences.

Attention replaces static context with dynamic context:

Each output position can focus on relevant input positions.
Context is recomputed per token.
Query-Key-Value (QKV) formalism makes this differentiable and learnable.

flowchart LR
  A["Input Tokens"] --> B["Encoder States"]
  B --> C["Attention (Dynamic Context)"]
  C --> D["Decoder / Output Projection"]

3. Self-Attention Core¶

For input matrix X:

Q = X * W_Q
K = X * W_K
V = X * W_V

Scaled dot-product attention:

Scores = (Q * K^T) / sqrt(d_k)
Alpha  = softmax(Scores)
Output = Alpha * V

Where: - d_k is key dimension used for scaling. - Alpha is attention-weight matrix (row-wise probabilities). - Output is the context-aware representation.

Why scale by sqrt(d_k): - Prevents large dot-product magnitudes. - Stabilizes softmax gradients.

4. Why Single Self-Attention Is Not Enough¶

A single attention map is often insufficient to capture all linguistic patterns at once. One sentence can contain multiple simultaneous structures:

subject-verb agreement
noun modifier relations
dependency/coreference relations
named entities
semantic/contextual relations

This motivates multi-head attention.

5. Multi-Head Attention (MHA)¶

Each head learns a different attention pattern subspace.

Per-head computation:

head_i = Attention(Q_i, K_i, V_i)

Concatenate and project:

MHA(X) = Concat(head_1, head_2, ..., head_h) * W_O

Dimension rule:

d_k = d_v = d_model / h

Worked shape example:

d_model = 12
h = 4
per-head d_k = 3
each head output is size 3
concatenated output size is 12
final projection by W_O maps back to model dimension

flowchart LR
  X["Input X"] --> H1["Head 1"]
  X --> H2["Head 2"]
  X --> H3["Head 3"]
  X --> H4["Head 4"]
  H1 --> C["Concatenate"]
  H2 --> C
  H3 --> C
  H4 --> C
  C --> WO["Linear Projection (W_O)"]
  WO --> Y["MHA Output"]

Notes: - Heads are parallelizable. - Number of heads is a hyperparameter. - More heads increase representational diversity but add computation.

6. Order Problem and Positional Encoding¶

Attention alone is permutation-invariant; it does not inherently encode token order. So position information is added to token embeddings.

Input to the first Transformer block:

X_input = TokenEmbedding + PositionalEncoding

Standard sinusoidal positional encoding:

PE(pos, 2i)   = sin(pos / 10000^(2i / d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i / d_model))

Key properties: - Depends only on position pos and channel index i. - Independent of actual token identity. - Allows model to distinguish sequences with same words but different order.

Example intuition: - "cat chased mouse" vs "mouse chased cat" - Same words, different order, different meaning - Positional encoding makes internal representations different.

7. Transformer Block Structure¶

Encoder block:

Input
 -> Multi-Head Self-Attention
 -> Add & LayerNorm
 -> Position-wise Feed Forward Network
 -> Add & LayerNorm
 -> Output

Decoder block adds: - masked self-attention (causal mask) - encoder-decoder cross-attention

flowchart TD
  A["Token + Positional Embedding"] --> B["Multi-Head Self-Attention"]
  B --> C["Add & LayerNorm"]
  C --> D["Feed Forward Network"]
  D --> E["Add & LayerNorm"]
  E --> F["Block Output"]

8. Training vs Inference (Important Distinction)¶

Two stages must be separated conceptually:

Training:
learn parameters (W_Q, W_K, W_V, W_O, FFN weights, embeddings)
learn task behavior from input-output supervision (or pretraining objective)
Inference:
use learned parameters to generate/predict outputs
no gradient updates in standard serving path

In practical systems, models may be periodically updated/retrained as data distribution shifts.

9. End-to-End NLP Pipeline (Transformer View)¶

Raw text
 -> Tokenization
 -> Token IDs
 -> Token Embedding
 -> Add Positional Encoding
 -> Stacked Transformer Blocks
 -> Task Head (classification / generation / tagging)
 -> Output

For translation:

Source sentence
 -> Encoder representation
 -> Decoder (masked self-attn + cross-attn)
 -> Target sentence generation

10. Practical Hyperparameters¶

d_model: embedding/model width
h: number of attention heads
d_ff: FFN hidden width
N: number of encoder/decoder layers
dropout rates
max sequence length

General trade-off: - Higher capacity improves expressiveness. - Cost rises in memory/latency.

11. Pseudocode¶

Input: token_ids
X = token_embedding(token_ids)
P = positional_encoding(length(token_ids), d_model)
H = X + P

for layer in 1..N:
    A = multi_head_attention(H)               # self-attention
    H = layer_norm(H + A)                     # residual 1
    F = feed_forward(H)
    H = layer_norm(H + F)                     # residual 2

Output = task_head(H)

12. Quick Revision¶

Transformer replaces recurrent sequence processing with attention.
Core operator: softmax(QK^T / sqrt(d_k))V.
Multi-head attention captures multiple feature patterns in parallel.
Positional encoding is required to inject order information.
Final architecture combines MHA, residuals, normalization, and FFN blocks.
Training and inference should be treated as separate operational phases.