Encoder, Decoder, and Encoder-Decoder Models
Three Families, One Foundationโ
The Transformer architecture can be configured three distinct ways, each optimized for a different class of tasks. Most models you encounter fall into one of these families.
Encoder-Only: Understanding Without Generatingโ
Canonical model: BERT
Encoder-only models use bidirectional attention โ each token attends to all other tokens in both directions. This produces rich, context-aware representations of the full input.
What they're good at:
- Classification: sentiment analysis, topic labeling, spam detection
- Embedding generation: a single vector representing the semantic meaning of a sentence or document (foundational for RAG and semantic search)
- Named entity recognition: identifying and labeling spans of text
- Extractive QA: finding the answer span in a given passage
What they cannot do: generate fluent, open-ended text. There is no autoregressive decoder; the model produces representations, not new tokens.
Decoder-Only: Generating Textโ
Canonical models: GPT-4, Claude, Llama 3, Gemini, Mistral
Decoder-only models use causal attention (masked self-attention): each token can only attend to itself and the tokens before it. This enforces the autoregressive property โ you can only use past context to predict the next token.
This constraint is what makes generative models work. The model trains by predicting the next token, learns the full distribution of language, and can generate coherent extended text.
Modern instruction-tuned assistants (ChatGPT, Claude, Gemini) are decoder-only models fine-tuned with RLHF or constitutional AI to follow instructions and produce helpful responses.
Encoder-Decoder: Transforming One Sequence Into Anotherโ
Canonical models: T5, BART
These models separate the understanding step (encoder) from the generation step (decoder). The encoder processes the full input with bidirectional attention; the decoder generates output token-by-token, attending to both the encoder's representation (cross-attention) and its own prior outputs.
Best suited for tasks where the output is a structured transformation of the input:
- Machine translation (English โ French)
- Summarization (long article โ short abstract)
- Abstractive QA (passage + question โ generated answer)
When to Use Whichโ
| Need | Architecture | Typical examples |
|---|---|---|
| Semantic search, RAG retrieval | Encoder-only | text-embedding-3-small, BAAI/bge-m3 |
| Text generation, chat | Decoder-only | Claude, GPT-4, Llama 3 |
| Translation, summarization | Encoder-decoder | T5, BART |
| Classification tasks | Encoder-only | Fine-tuned DistilBERT |
In practice, frontier decoder-only models (GPT-4, Claude) are now so capable at summarization and translation that pure encoder-decoder models have become niche. But embedding models โ almost always encoder-only โ remain essential infrastructure for any retrieval system.
When your team says "the model," they almost certainly mean a decoder-only generative model. Embedding models โ which power search and RAG โ are a different architecture with different evaluation criteria. They need to be selected and assessed separately.
Further Readingโ
- Training Pipeline โ how these architectures are trained and aligned
- Embeddings โ what encoder-only models produce and why it matters