Attention Mechanism โ How Transformers Read Context
The Problem Attention Solvesโ
Before transformers, language models used RNNs (recurrent neural networks) that processed text left to right, one token at a time. By the time the model reached word 50, information from word 1 had passed through 49 transformations and was largely lost. Long-range dependencies โ a pronoun on line 3 referring to a subject on line 1 โ were hard to maintain.
Attention solves this by letting every token look directly at every other token, regardless of distance. The model doesn't need to "remember" what it saw โ it can examine the full context all at once.
Queries, Keys, and Valuesโ
The mechanism is built on three concepts, abbreviated Q, K, V:
- Query (Q): What is this token looking for?
- Key (K): What does each other token advertise about itself?
- Value (V): What information does each token contribute if selected?
Imagine searching a library. Your search terms (Q) match against index entries (K). Books that match (high score) contribute their content (V) to your answer.
For each token, attention computes a score against every other token's key, normalizes those scores into weights, and produces a weighted sum of the values. The result: a new representation for that token incorporating context from all others, proportional to relevance.
Multi-Head Attentionโ
A single Q/K/V computation captures one type of relationship. Multi-head attention runs it multiple times in parallel, each with different learned weights, letting different heads specialize:
- One head might track syntactic dependencies (subject-verb agreement)
- Another might track coreference (which pronoun refers to which noun)
- Another might track discourse structure
GPT-4 class models use 96 attention heads per layer. Smaller 7B models use 32.
Self-Attention vs. Cross-Attentionโ
Self-attention: each token attends to other tokens in the same sequence. Used in both encoder-only models (like BERT) and decoder-only models (like GPT).
Cross-attention: tokens in one sequence attend to tokens in a different sequence. Used in encoder-decoder models (T5, BART) to connect the encoder's understanding of the input to the decoder's generation of the output.
Why This Enables Long-Context Understandingโ
Because attention connects any two positions directly, increasing context length doesn't degrade quality the way RNNs did. The bottleneck is compute: attention scales quadratically with sequence length (O(nยฒ)). This is why context window expansion is expensive, and why efficient attention variants (Flash Attention, Sliding Window Attention) exist to manage the cost.
What Attention Doesn't Doโ
Attention is a routing and weighting mechanism. It decides which parts of the context are relevant. The factual knowledge โ what to do with that relevant context โ lives in the feed-forward layers that follow each attention block.
This matters: you can increase a model's context window without adding any knowledge. You can add knowledge via fine-tuning without changing the attention mechanism. They're separable.
A model that "forgets" content at the start of a long context is not running out of memory โ the attention weights are spread thin across too many tokens, making early content effectively low-weight. This is a compute architecture constraint, not a memory bug.
Further Readingโ
- Transformer Architecture โ how attention fits into the full model stack
- Context Windows โ the practical implications and costs of long context