Skip to content

Transformer Networks: Concepts, Mathematics, and Workflow

Transformers are sequence models built around attention, designed to capture dependencies across tokens without strict recurrent processing.


1. Why Transformers Were Introduced

Classical RNN/LSTM sequence modeling has core bottlenecks:

  • Sequential computation limits parallelism.
  • Long-range dependencies are hard to preserve.
  • Sequence-to-sequence tasks with different input/output lengths are difficult to optimize.

Transformers address this by replacing recurrence with attention-based token interactions.


2. From Encoder-Decoder to Dynamic Context

Early sequence-to-sequence encoder-decoder models used a single context vector from the encoder's final hidden state. That static bottleneck is weak for long/complex sequences.

Attention replaces static context with dynamic context:

  • Each output position can focus on relevant input positions.
  • Context is recomputed per token.
  • Query-Key-Value (QKV) formalism makes this differentiable and learnable.
flowchart LR
  A["Input Tokens"] --> B["Encoder States"]
  B --> C["Attention (Dynamic Context)"]
  C --> D["Decoder / Output Projection"]

3. Self-Attention Core

For input matrix X:

Q = X * W_Q
K = X * W_K
V = X * W_V

Scaled dot-product attention:

Scores = (Q * K^T) / sqrt(d_k)
Alpha  = softmax(Scores)
Output = Alpha * V

Where: - d_k is key dimension used for scaling. - Alpha is attention-weight matrix (row-wise probabilities). - Output is the context-aware representation.

Why scale by sqrt(d_k): - Prevents large dot-product magnitudes. - Stabilizes softmax gradients.


4. Why Single Self-Attention Is Not Enough

A single attention map is often insufficient to capture all linguistic patterns at once. One sentence can contain multiple simultaneous structures:

  • subject-verb agreement
  • noun modifier relations
  • dependency/coreference relations
  • named entities
  • semantic/contextual relations

This motivates multi-head attention.


5. Multi-Head Attention (MHA)

Each head learns a different attention pattern subspace.

Per-head computation:

head_i = Attention(Q_i, K_i, V_i)

Concatenate and project:

MHA(X) = Concat(head_1, head_2, ..., head_h) * W_O

Dimension rule:

d_k = d_v = d_model / h

Worked shape example:

  • d_model = 12
  • h = 4
  • per-head d_k = 3
  • each head output is size 3
  • concatenated output size is 12
  • final projection by W_O maps back to model dimension
flowchart LR
  X["Input X"] --> H1["Head 1"]
  X --> H2["Head 2"]
  X --> H3["Head 3"]
  X --> H4["Head 4"]
  H1 --> C["Concatenate"]
  H2 --> C
  H3 --> C
  H4 --> C
  C --> WO["Linear Projection (W_O)"]
  WO --> Y["MHA Output"]

Notes: - Heads are parallelizable. - Number of heads is a hyperparameter. - More heads increase representational diversity but add computation.


6. Order Problem and Positional Encoding

Attention alone is permutation-invariant; it does not inherently encode token order. So position information is added to token embeddings.

Input to the first Transformer block:

X_input = TokenEmbedding + PositionalEncoding

Standard sinusoidal positional encoding:

PE(pos, 2i)   = sin(pos / 10000^(2i / d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i / d_model))

Key properties: - Depends only on position pos and channel index i. - Independent of actual token identity. - Allows model to distinguish sequences with same words but different order.

Example intuition: - "cat chased mouse" vs "mouse chased cat" - Same words, different order, different meaning - Positional encoding makes internal representations different.


7. Transformer Block Structure

Encoder block:

Input
 -> Multi-Head Self-Attention
 -> Add & LayerNorm
 -> Position-wise Feed Forward Network
 -> Add & LayerNorm
 -> Output

Decoder block adds: - masked self-attention (causal mask) - encoder-decoder cross-attention

flowchart TD
  A["Token + Positional Embedding"] --> B["Multi-Head Self-Attention"]
  B --> C["Add & LayerNorm"]
  C --> D["Feed Forward Network"]
  D --> E["Add & LayerNorm"]
  E --> F["Block Output"]

8. Training vs Inference (Important Distinction)

Two stages must be separated conceptually:

  • Training:
  • learn parameters (W_Q, W_K, W_V, W_O, FFN weights, embeddings)
  • learn task behavior from input-output supervision (or pretraining objective)
  • Inference:
  • use learned parameters to generate/predict outputs
  • no gradient updates in standard serving path

In practical systems, models may be periodically updated/retrained as data distribution shifts.


9. End-to-End NLP Pipeline (Transformer View)

Raw text
 -> Tokenization
 -> Token IDs
 -> Token Embedding
 -> Add Positional Encoding
 -> Stacked Transformer Blocks
 -> Task Head (classification / generation / tagging)
 -> Output

For translation:

Source sentence
 -> Encoder representation
 -> Decoder (masked self-attn + cross-attn)
 -> Target sentence generation

10. Practical Hyperparameters

  • d_model: embedding/model width
  • h: number of attention heads
  • d_ff: FFN hidden width
  • N: number of encoder/decoder layers
  • dropout rates
  • max sequence length

General trade-off: - Higher capacity improves expressiveness. - Cost rises in memory/latency.


11. Pseudocode

Input: token_ids
X = token_embedding(token_ids)
P = positional_encoding(length(token_ids), d_model)
H = X + P

for layer in 1..N:
    A = multi_head_attention(H)               # self-attention
    H = layer_norm(H + A)                     # residual 1
    F = feed_forward(H)
    H = layer_norm(H + F)                     # residual 2

Output = task_head(H)

12. Quick Revision

  • Transformer replaces recurrent sequence processing with attention.
  • Core operator: softmax(QK^T / sqrt(d_k))V.
  • Multi-head attention captures multiple feature patterns in parallel.
  • Positional encoding is required to inject order information.
  • Final architecture combines MHA, residuals, normalization, and FFN blocks.
  • Training and inference should be treated as separate operational phases.