Encoder, Decoder, and Encoder-Decoder Models

PM: Skim — 15 min

Three Families, One Foundation

The Transformer architecture can be configured three distinct ways, each optimized for a different class of tasks. Most models you encounter fall into one of these families.

Encoder-Only: Understanding Without Generating

Canonical model: BERT

Encoder-only models use bidirectional attention — each token attends to all other tokens in both directions. This produces rich, context-aware representations of the full input.

What they're good at:

Classification: sentiment analysis, topic labeling, spam detection
Embedding generation: a single vector representing the semantic meaning of a sentence or document (foundational for RAG and semantic search)
Named entity recognition: identifying and labeling spans of text
Extractive QA: finding the answer span in a given passage

What they cannot do: generate fluent, open-ended text. There is no autoregressive decoder; the model produces representations, not new tokens.

Decoder-Only: Generating Text

Canonical models: GPT, Claude, Llama, Gemini, Mistral, DeepSeek

Decoder-only models use causal attention (masked self-attention): each token can only attend to itself and the tokens before it. This enforces the autoregressive property — you can only use past context to predict the next token.

This constraint is what makes generative models work. The model trains by predicting the next token, learns the full distribution of language, and can generate coherent extended text.

Modern instruction-tuned assistants (ChatGPT, Claude, Gemini) are decoder-only models fine-tuned with RLHF or constitutional AI to follow instructions and produce helpful responses.

Encoder-Decoder: Transforming One Sequence Into Another

Canonical models: T5, BART

These models separate the understanding step (encoder) from the generation step (decoder). The encoder processes the full input with bidirectional attention; the decoder generates output token-by-token, attending to both the encoder's representation (cross-attention) and its own prior outputs.

Best suited for tasks where the output is a structured transformation of the input:

Machine translation (English → French)
Summarization (long article → short abstract)
Abstractive QA (passage + question → generated answer)

When to Use Which

Need	Architecture	Typical examples
Semantic search, RAG retrieval	Encoder-only	`text-embedding-3-small`, `BAAI/bge-m3`
Text generation, chat	Decoder-only	Claude, GPT, Gemini, Llama, DeepSeek
Translation, summarization	Encoder-decoder	T5, BART
Classification tasks	Encoder-only	Fine-tuned DistilBERT

In practice, frontier decoder-only models (GPT, Claude, Gemini) are now so capable at summarization and translation that pure encoder-decoder models have become niche. But embedding models — almost always encoder-only — remain essential infrastructure for any retrieval system.

PM Takeaway

When your team says "the model," they almost certainly mean a decoder-only generative model. Embedding models — which power search and RAG — are a different architecture with different evaluation criteria. They need to be selected and assessed separately.

Three Families, One Foundation​

Encoder-Only: Understanding Without Generating​

Decoder-Only: Generating Text​

Encoder-Decoder: Transforming One Sequence Into Another​

When to Use Which​

Further Reading​

Three Families, One Foundation

Encoder-Only: Understanding Without Generating

Decoder-Only: Generating Text

Encoder-Decoder: Transforming One Sequence Into Another

When to Use Which

Further Reading