Transformer Architecture — How LLMs Are Built

PM: Skim — 20 min

The Architecture Behind Every Modern LLM

In 2017, a Google paper titled "Attention Is All You Need" introduced the Transformer. Almost every major language model today — GPT, Claude, Gemini, Llama, DeepSeek — is a Transformer or a close descendant.

Understanding the transformer doesn't require the math. Understanding the structure — what's in it and why — gives you the intuitions to reason about model behavior.

The Three-Phase Stack

A transformer processes text through three stages:

Stage 1 — Tokenization and Embedding: Text splits into tokens; each token ID maps to a vector via a lookup table. These are the model's initial representations.

Stage 2 — N Transformer Blocks: Vectors pass through a stack of identical blocks (GPT-3 uses 96; smaller models use 12–32). Each block refines representations by having each token "look at" all other tokens and update its meaning based on context.

Stage 3 — Output Head: The final vector for the next position projects onto a probability distribution over the vocabulary. One token is sampled; the process repeats.

Inside a Transformer Block

Each block has two sub-components:

Multi-head self-attention: Each token computes a weighted sum of all other tokens, proportional to relevance. "Multi-head" means this runs in parallel with different learned patterns — the model can simultaneously track who is speaking, what they're discussing, and the sentiment, each in a separate head.

Feed-forward network (FFN): A two-layer neural network applied to each position independently. This is where most factual knowledge is encoded — attention routes information; the FFN stores what the model knows about concepts.

What Happens as Layers Deepen

Early layers capture syntax and surface patterns. Middle layers build semantic relationships. Final layers organize the output for the specific generation task. This isn't a clean division — it's a gradient — and it explains why models with more layers handle nuanced reasoning better.

Positional Encoding

Self-attention is inherently position-agnostic: it treats a word identically whether it appears first or last. To give the model position information, a positional encoding is added to each token's embedding before the transformer blocks.

Modern models use rotary positional encoding (RoPE) or ALiBi, which generalize better to longer sequences than the original sinusoidal encoding.

What Parameter Count Means

When you hear "7 billion parameters," those parameters are the weight matrices in:

The embedding table
Every attention layer's Q, K, V, and output projections
Every FFN's two weight matrices

More parameters means a larger embedding dimension, more layers, bigger FFN — more representational capacity. Whether that translates to better performance on a specific task depends on training and fine-tuning, not parameter count alone.

PM Takeaway

The stack of transformer blocks is what makes larger models qualitatively better at complex reasoning — not just "more memory," but a deeper cascade of contextual refinement. When someone says a model is "deeper," they mean more transformer blocks, which is distinct from how much data it trained on.

The Architecture Behind Every Modern LLM​

The Three-Phase Stack​

Inside a Transformer Block​

What Happens as Layers Deepen​

Positional Encoding​

What Parameter Count Means​

Further Reading​