Retrieval-Augmented Generation (RAG)

PM: Read in full — 20 min

The Problem: Your Model Doesn't Know What You Know

An LLM's knowledge cuts off at its training date. It doesn't know about your company's internal policies, last quarter's earnings call, or what changed in your product last week. You could retrain it—but retraining is expensive, slow, and requires your private data to enter someone's training pipeline. RAG is how you solve this without retraining.

Retrieval-Augmented Generation (RAG) connects a language model to an external knowledge base at query time. Instead of baking information into the model's weights, you look it up fresh for each request. The model's job shifts from "remember everything" to "reason about what you're given."

How RAG Works: Two Phases

RAG is a two-phase system: an offline indexing phase that runs once (or whenever your documents change), and an online query phase that runs at every inference call.

Phase 1: Build the Index (Offline)

Take your source documents—PDFs, wiki pages, database exports, support tickets, whatever your system needs to know—and process them through a pipeline:

Split documents into chunks of roughly 512 tokens. Chunk boundaries matter: splitting mid-sentence or mid-table degrades retrieval quality.
Embed each chunk by passing it through an embedding model. The output is a dense vector—a list of numbers that encodes the semantic meaning of that chunk.
Store the vector alongside the original text in a vector database (Pinecone, Weaviate, pgvector, etc.).

That's it for the offline step. Your documents are now searchable by meaning, not just by keyword.

Phase 2: Answer a Question (Online)

When a user submits a query, the same embedding model encodes that query into a vector. The system searches the vector database for the top-K chunks whose vectors are most similar to the query vector. Those chunks—your relevant source material—get injected into the prompt alongside the user's question. The LLM reads the chunks and generates an answer grounded in that retrieved context.

RAG vs. Fine-Tuning vs. Long Context

These three approaches often get conflated. They're not interchangeable.

RAG vs. fine-tuning: Fine-tuning is best for changing how a model behaves—its tone, its adherence to a format, its domain-specific vocabulary. RAG is best for changing what a model knows. If you need the model to answer questions about documents that change weekly, fine-tuning doesn't help. You'd have to retrain weekly. RAG just requires re-indexing.

RAG vs. long context: Modern frontier models can hold 100K–1M+ tokens in their context window. For a small, stable document set, you could just paste everything in. But for large or growing document collections, RAG selects the relevant chunks—you don't pay for the rest. Long context is simpler (no indexing infrastructure) but more expensive and doesn't scale to thousands of documents.

Where RAG Breaks

A RAG system is only as good as its retrieval step. Common failure modes:

Bad chunking: Splitting at arbitrary token boundaries breaks semantic units. A chunk that ends mid-paragraph about pricing and starts mid-paragraph about shipping doesn't retrieve well for either topic.
Weak embedding model: The query vector and document vectors come from the same model. If that model doesn't understand your domain's vocabulary, similar concepts don't retrieve each other.
Too few chunks (K too small): The answer is in document #6, but you only retrieved the top 3. Increase K—but watch the cost.
Too many chunks (K too large): Retrieving 20 loosely-related chunks dilutes the context with noise. The LLM may end up ignoring the relevant chunk or synthesizing an answer from several irrelevant ones.
Position matters inside the context: the "Lost in the Middle" paper (Liu et al., 2023) measured how well models used information at different positions in a long context and found a U-shaped curve: models use material from the start (75% accuracy) and end (72%) of a retrieved context significantly better than from the middle (below 40%). If you inject ten chunks, the model effectively discounts those in the middle. Reranking to place the highest-relevance chunks first and last — not buried in the middle — measurably improves answer quality.

The failure mode that catches teams by surprise is retrieval precision. When the LLM gives a wrong answer, the instinct is to improve the model. Often, the retrieval returned the wrong documents.

Measured real-world accuracy: the CRAG benchmark (Yang et al., 2024) evaluated LLM systems on 4,409 realistic open-domain questions across five domains and found a baseline accuracy of 34% without retrieval, improving to 44% with RAG. These figures reflect a deliberately hard benchmark — diverse, trivia-difficulty, real-world queries. For domain-specific enterprise RAG (answering questions about your own documentation, product specs, or internal policies), retrieval spaces are narrower and accuracy is typically much higher. The benchmark measures the ceiling of generality, not the ceiling of focus.

Why This Matters

RAG is now the standard pattern for enterprise AI applications that need to work with proprietary or time-sensitive knowledge. It's the reason you can build a product that answers questions about your internal docs without sending your documents to a foundation model provider's training pipeline. The knowledge stays in your vector database; only the retrieved chunks enter the prompt.

Understanding RAG at this level lets you ask the right questions when something goes wrong: Was it a retrieval failure or a generation failure? Were the retrieved chunks actually relevant? Was the chunk size appropriate for this content type? These aren't questions you can answer if you treat RAG as a black box.

PM Tip

RAG is not a silver bullet. A RAG system is only as good as its retrieval step. Before blaming the LLM for wrong answers, check what was actually retrieved and injected into the prompt—often the retrieval failed, not the generation.

Going Deeper

The patterns here are the conceptual foundation. Advanced RAG Techniques covers what production systems actually do: structure-aware chunking, Contextual Retrieval, hierarchical chunk strategies, HyDE, and reranking.

The Problem: Your Model Doesn't Know What You Know​

How RAG Works: Two Phases​

Phase 1: Build the Index (Offline)​

Phase 2: Answer a Question (Online)​

RAG vs. Fine-Tuning vs. Long Context​

Where RAG Breaks​

Why This Matters​

Further Reading​