Transformer Architecture Deep Dive
Understand the architecture powering every major LLM. Self-attention, positional encoding, multi-head attention, and why transformers won.
The Architecture That Changed Everything
In 2017, Google published “Attention Is All You Need” — the paper that introduced the Transformer. It’s now the foundation of GPT, Claude, Gemini, Llama, and virtually every major language model.
Self-Attention: The Core Mechanism
The transformer’s breakthrough is self-attention: every token in the input can directly attend to every other token, in parallel.
How It Works
For each token, the model computes three vectors:
- Query (Q): “What information am I looking for?”
- Key (K): “What information do I contain?”
- Value (V): “Here’s my actual information”
The attention score between two tokens is:
$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$
The $\sqrt{d_k}$ scaling prevents attention scores from becoming too sharp when dimensions are large.
Why It Matters
In the sentence “The cat sat on the mat because it was tired”:
- Self-attention helps the model figure out that “it” refers to “cat” (not “mat”)
- This happens by computing high attention scores between “it” and “cat”
Previous architectures (RNNs, LSTMs) processed tokens sequentially — by the time they reached “it”, the signal from “cat” might have faded. Self-attention solves this by processing all tokens simultaneously.
Positional Encoding
Since self-attention processes all tokens in parallel, it has no inherent notion of word order. “Dog bites man” and “Man bites dog” would look the same.
Positional encodings add position information using sinusoidal functions at different frequencies:
$$PE_{(pos, 2i)} = \sin(pos / 10000^{2i/d_{model}})$$ $$PE_{(pos, 2i+1)} = \cos(pos / 10000^{2i/d_{model}})$$
Modern models often use Rotary Positional Embeddings (RoPE) instead, which encode relative positions more effectively and support longer context windows.
Multi-Head Attention
Instead of a single attention computation, transformers use multiple heads in parallel:
- Head 1 might learn syntactic dependencies (subject-verb agreement)
- Head 2 might learn semantic relationships (synonyms, related concepts)
- Head 3 might focus on local context (nearby words)
Each head has its own Q, K, V weight matrices. Their outputs are concatenated and projected back to the model dimension.
The Full Transformer Block
A transformer block consists of:
- Multi-Head Self-Attention — token relationships
- Layer Normalisation — stabilises training
- Feed-Forward Network — two linear layers with activation (often SwiGLU in modern models)
- Residual Connections — add the input back to the output, enabling gradient flow
Modern LLMs stack 32-128 of these blocks. GPT-4 is estimated to have ~120 layers with ~1.8 trillion parameters across a mixture-of-experts architecture.
Decoder-Only vs Encoder-Decoder
| Architecture | Models | Use Case |
|---|---|---|
| Encoder-only | BERT, RoBERTa | Classification, embeddings |
| Encoder-Decoder | T5, BART | Translation, summarisation |
| Decoder-only | GPT, Claude, Llama | Text generation (what most LLMs use now) |
Modern LLMs are almost exclusively decoder-only with causal masking — each token can only attend to previous tokens, not future ones. This enables autoregressive generation.
The Training Pipeline
- Pre-training: Predict next token on massive text corpora (trillions of tokens)
- Supervised Fine-tuning (SFT): Train on high-quality instruction/response pairs
- RLHF/DPO: Align with human preferences for helpfulness and safety
Key insight: Understanding the transformer architecture helps you reason about model capabilities and limitations — why context matters, why position affects attention, and why models struggle with certain tasks.
Quick Quiz
Test what you just learned. Pick the best answer for each question.
Q1 What is the core innovation of the Transformer architecture?
Q2 What do the Q, K, and V matrices represent in self-attention?
Q3 Why do transformers need positional encoding?
Q4 What is the purpose of multi-head attention?