Skip to main content

Retrieval-Augmented Generation (RAG)

PM: Read in full — 20 min

The Problem: Your Model Doesn't Know What You Know

An LLM trained in 2024 doesn't know about your company's internal policies, last quarter's earnings call, or what changed in your product last week. You could retrain it—but retraining is expensive, slow, and requires your private data to enter someone's training pipeline. RAG is how you solve this without retraining.

Retrieval-Augmented Generation (RAG) connects a language model to an external knowledge base at query time. Instead of baking information into the model's weights, you look it up fresh for each request. The model's job shifts from "remember everything" to "reason about what you're given."

How RAG Works: Two Phases

RAG is a two-phase system: an offline indexing phase that runs once (or whenever your documents change), and an online query phase that runs at every inference call.

Phase 1: Build the Index (Offline)

Take your source documents—PDFs, wiki pages, database exports, support tickets, whatever your system needs to know—and process them through a pipeline:

  1. Split documents into chunks of roughly 512 tokens. Chunk boundaries matter: splitting mid-sentence or mid-table degrades retrieval quality.
  2. Embed each chunk by passing it through an embedding model. The output is a dense vector—a list of numbers that encodes the semantic meaning of that chunk.
  3. Store the vector alongside the original text in a vector database (Pinecone, Weaviate, pgvector, etc.).

That's it for the offline step. Your documents are now searchable by meaning, not just by keyword.

Phase 2: Answer a Question (Online)

When a user submits a query, the same embedding model encodes that query into a vector. The system searches the vector database for the top-K chunks whose vectors are most similar to the query vector. Those chunks—your relevant source material—get injected into the prompt alongside the user's question. The LLM reads the chunks and generates an answer grounded in that retrieved context.

RAG vs. Fine-Tuning vs. Long Context

These three approaches often get conflated. They're not interchangeable.

RAG vs. fine-tuning: Fine-tuning is best for changing how a model behaves—its tone, its adherence to a format, its domain-specific vocabulary. RAG is best for changing what a model knows. If you need the model to answer questions about documents that change weekly, fine-tuning doesn't help. You'd have to retrain weekly. RAG just requires re-indexing.

RAG vs. long context: Modern models can hold 128K or more tokens in their context window. For a small, stable document set, you could just paste everything in. But for large or growing document collections, RAG selects the relevant chunks—you don't pay for the rest. Long context is simpler (no indexing infrastructure) but more expensive and doesn't scale to thousands of documents.

Where RAG Breaks

A RAG system is only as good as its retrieval step. Common failure modes:

  • Bad chunking: Splitting at arbitrary token boundaries breaks semantic units. A chunk that ends mid-paragraph about pricing and starts mid-paragraph about shipping doesn't retrieve well for either topic.
  • Weak embedding model: The query vector and document vectors come from the same model. If that model doesn't understand your domain's vocabulary, similar concepts don't retrieve each other.
  • Too few chunks (K too small): The answer is in document #6, but you only retrieved the top 3. Increase K—but watch the cost.
  • Too many chunks (K too large): Retrieving 20 loosely-related chunks dilutes the context with noise. The LLM may end up ignoring the relevant chunk or synthesizing an answer from several irrelevant ones.

The failure mode that catches teams by surprise is retrieval precision. When the LLM gives a wrong answer, the instinct is to improve the model. Often, the retrieval returned the wrong documents.

Why This Matters

RAG is now the standard pattern for enterprise AI applications that need to work with proprietary or time-sensitive knowledge. It's the reason you can build a product that answers questions about your internal docs without sending your documents to a foundation model provider's training pipeline. The knowledge stays in your vector database; only the retrieved chunks enter the prompt.

Understanding RAG at this level lets you ask the right questions when something goes wrong: Was it a retrieval failure or a generation failure? Were the retrieved chunks actually relevant? Was the chunk size appropriate for this content type? These aren't questions you can answer if you treat RAG as a black box.

PM Tip

RAG is not a silver bullet. A RAG system is only as good as its retrieval step. Before blaming the LLM for wrong answers, check what was actually retrieved and injected into the prompt—often the retrieval failed, not the generation.

Going Deeper

The patterns here are the conceptual foundation. Advanced RAG Techniques covers what production systems actually do: structure-aware chunking, Contextual Retrieval, hierarchical chunk strategies, HyDE, and reranking.

Further Reading

Lewis, P. et al.NeurIPS 2020 (2020)
Read:Abstract, Figure 1.Skip:Sections 2–5 (model architecture, experiments).
The original RAG paper: retrieve relevant documents at query time and inject them into the prompt — reduces hallucination and makes LLMs usable on private/current knowledge.