Transformer Architecture Deep Dive

Understand the architecture powering every major LLM. Self-attention, positional encoding, multi-head attention, and why transformers won.

The Architecture That Changed Everything

In 2017, Google published “Attention Is All You Need” — the paper that introduced the Transformer. It’s now the foundation of GPT, Claude, Gemini, Llama, and virtually every major language model.

Self-Attention: The Core Mechanism

The transformer’s breakthrough is self-attention: every token in the input can directly attend to every other token, in parallel.

How It Works

For each token, the model computes three vectors:

Query (Q): “What information am I looking for?”
Key (K): “What information do I contain?”
Value (V): “Here’s my actual information”

The attention score between two tokens is:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

The $\sqrt{d_k}$ scaling prevents attention scores from becoming too sharp when dimensions are large.

Why It Matters

In the sentence “The cat sat on the mat because it was tired”:

Self-attention helps the model figure out that “it” refers to “cat” (not “mat”)
This happens by computing high attention scores between “it” and “cat”

Previous architectures (RNNs, LSTMs) processed tokens sequentially — by the time they reached “it”, the signal from “cat” might have faded. Self-attention solves this by processing all tokens simultaneously.

Positional Encoding

Since self-attention processes all tokens in parallel, it has no inherent notion of word order. “Dog bites man” and “Man bites dog” would look the same.

Positional encodings add position information using sinusoidal functions at different frequencies:

$$PE_{(pos, 2i)} = \sin(pos / 10000^{2i/d_{model}})$$ $$PE_{(pos, 2i+1)} = \cos(pos / 10000^{2i/d_{model}})$$

Modern models often use Rotary Positional Embeddings (RoPE) instead, which encode relative positions more effectively and support longer context windows.

Multi-Head Attention

Instead of a single attention computation, transformers use multiple heads in parallel:

Head 1 might learn syntactic dependencies (subject-verb agreement)
Head 2 might learn semantic relationships (synonyms, related concepts)
Head 3 might focus on local context (nearby words)

Each head has its own Q, K, V weight matrices. Their outputs are concatenated and projected back to the model dimension.

The Full Transformer Block

A transformer block consists of:

Multi-Head Self-Attention — token relationships
Layer Normalisation — stabilises training
Feed-Forward Network — two linear layers with activation (often SwiGLU in modern models)
Residual Connections — add the input back to the output, enabling gradient flow

Modern LLMs stack 32-128 of these blocks. GPT-4 is estimated to have ~120 layers with ~1.8 trillion parameters across a mixture-of-experts architecture.

Decoder-Only vs Encoder-Decoder

Architecture	Models	Use Case
Encoder-only	BERT, RoBERTa	Classification, embeddings
Encoder-Decoder	T5, BART	Translation, summarisation
Decoder-only	GPT, Claude, Llama	Text generation (what most LLMs use now)

Modern LLMs are almost exclusively decoder-only with causal masking — each token can only attend to previous tokens, not future ones. This enables autoregressive generation.

The Training Pipeline

Pre-training: Predict next token on massive text corpora (trillions of tokens)
Supervised Fine-tuning (SFT): Train on high-quality instruction/response pairs
RLHF/DPO: Align with human preferences for helpfulness and safety

Key insight: Understanding the transformer architecture helps you reason about model capabilities and limitations — why context matters, why position affects attention, and why models struggle with certain tasks.