Advanced RAG Techniques

PM: Read in full — 25 min

Why Naive RAG Breaks at Scale

The basic RAG pattern — chunk at 512 tokens, embed, retrieve top-K — works in demos. In production, it has three failure modes that compound each other:

Token-boundary chunks destroy semantic units. A chunk that ends mid-argument and starts mid-paragraph on a different topic retrieves poorly for both.
Decontextualized embeddings lose document structure. A chunk that says "in this context, sanctification refers to ongoing moral transformation" retrieves well for "what is sanctification?" but the embedding can't tell you it's from a chapter on justification where the author is drawing a contrast — the distinction that makes the chunk meaningful.
Query-document semantic gap: a user's question ("why does X fail?") and a document chunk that answers it ("X fails because...") are in different semantic spaces. Embedding the question and finding the answer requires more than nearest-neighbor search.

These failures don't show up in small demos because small demos have few documents and broad queries. They appear in production where documents are long, queries are specific, and the cost of a wrong answer is visible.

Better Chunking: Semantic Boundaries Over Token Counts

The 512-token target is a soft constraint, not a boundary. The actual boundary should be a semantic unit — a paragraph, a section, a logical argument.

Structure-first chunking

Split text at structural boundaries first:

Double newlines (\n\n) — paragraphs
Heading markers — sections
Sentence boundaries — fallback only when a paragraph exceeds the token limit

If a paragraph is shorter than the target, accumulate the next paragraph before creating a chunk. Never split inside a sentence to hit a token target.

The goal is that every chunk contains one coherent idea. A chunk that covers half of two ideas retrieves poorly for both. A chunk that fully covers one idea retrieves reliably for queries about that idea.

Contextual Retrieval

Structure-aware chunking solves the within-chunk problem. Contextual Retrieval solves the between-chunk problem: a chunk's embedding encodes what it says, but not where it sits in the document or why it matters.

The insight

Consider a chunk: "faith, then, is the foundation — but it must be distinguished from the mere intellectual assent that many mistake for it." The embedding encodes "faith, foundation, intellectual assent" — reasonable signal. But without knowing this chunk is from the middle of a structured argument about the prerequisites of salvation, from a document that has already established three prior points, the embedding is missing context that determines which queries it should answer.

How it works

For each chunk, prompt a fast LLM (Claude Haiku, GPT-4o-mini, Gemini Flash) with the full document and the chunk, and ask it to write 2–3 sentences situating the chunk in the broader document. Prepend that context to the chunk text before embedding. Store the original chunk text separately for display and citation.

The embedding now captures both the content and the role of the chunk. Anthropic published this technique and reported a 35% reduction in top-20-chunk retrieval failure rate from contextual embeddings alone; combined with BM25, 49%; with reranking added, 67% (Daniel Ford, "Contextual Retrieval," Anthropic, September 19, 2024, https://www.anthropic.com/news/contextual-retrieval).

Cost and tradeoffs

One LLM call per chunk at indexing time — a one-time cost per document
Use the cheapest fast model; context generation doesn't require reasoning, just reading comprehension
The embedding is generated from the combined text; the LLM reads the context-prefixed version at query time too (it's stored and injected as-is)
Adds latency to the indexing pipeline, not to query time

Hierarchical Chunking (Parent-Child)

Short chunks retrieve precisely — their embeddings are focused. But when a short chunk lands in the prompt, the LLM may not have enough surrounding context to reason well. Long chunks provide context but dilute the embedding signal, making retrieval less precise.

The solution: separate retrieval size from context size.

The pattern

Index child chunks (small, ~256 tokens) for retrieval. Every child belongs to a parent chunk (a full section, ~1,500 tokens). At query time: find the right child, then inject the parent into the prompt.

The vector search finds the right location with precision. The LLM reads the full section with context. You get both.

Implementation note

Store a parent_id on each child chunk. When retrieval returns child chunks, a second lookup fetches their parent chunks before building the prompt. If two matched child chunks share a parent, inject the parent once — no need to duplicate.

HyDE: Improving the Query

The techniques above improve the index. HyDE (Hypothetical Document Embeddings) improves the query side.

The problem

A user asks: "what causes transformer instability during training?" The query vector lands somewhere in semantic space. The correct document chunk says: "Gradient norm spikes in the early warmup phase are the primary driver of training instability in large Transformers." These two pieces of text are semantically related but not identical — queries are in "question space" and document chunks are in "answer space." The nearest neighbors of the query vector may not be the chunks that actually answer it.

The fix

Instead of embedding the query directly:

Prompt an LLM to generate a hypothetical answer to the query — a short passage that would appear in a document that answers this question
Embed the hypothetical answer
Use that embedding for similarity search

The hypothetical answer is in the same semantic space as real document chunks — it uses document-style vocabulary and framing. Even if it's factually wrong (it often is), it lands near the right neighborhood in vector space.

When HyDE helps

Short queries against long, dense documents
Technical or specialized domains where question phrasing differs from document phrasing
Questions phrased as "why does X happen?" against documents that explain X declaratively

When HyDE doesn't help

Factual lookups where the query terms already appear verbatim in the document
Adds one LLM call per query — latency cost in real-time applications
Occasionally the hypothetical is so wrong it steers search to the wrong region

Reranking: A Second Opinion on Relevance

Vector similarity search is fast but approximate. It operates by comparing pre-computed embeddings — the query and document vectors were computed independently and then compared. A cross-encoder reranker reads (query, chunk) pairs together in a single forward pass, letting it attend to the relationship between them directly.

The two-stage pattern

Retrieve: vector search returns top-50 candidates (fast)
Rerank: pass all 50 (query, chunk) pairs to the reranker; get precise scores
Inject: take the top-5 by reranked score

The reranker is more expensive than vector search but far cheaper than an LLM call. It runs on 50 short pairs — milliseconds, not seconds. The precision improvement is large enough that most production RAG systems use it.

Reranker options

Option	Type	Notes
Cohere Rerank	API (managed)	Easy integration, per-call pricing
BGE-Reranker	Open source	Self-hosted, good quality
ms-marco cross-encoders	Open source	Widely used, multiple size options

Putting It Together

These techniques compose. A production RAG pipeline looks like this:

Indexing (run once, or when documents change):

Structure-aware chunking at paragraph/section boundaries
Build parent-child chunk relationships
Generate contextual prefix per child chunk (LLM call)
Embed context + chunk text together
Store embedding, context-prefixed text, original text, and parent_id

Query (every request):

(Optional) HyDE: generate hypothetical answer, embed it
Vector search: retrieve top-50 child chunks by embedding similarity
Expand to parent chunks via parent_id lookup
Rerank: score all candidates with cross-encoder, take top-5
Inject top-5 parent chunks into the LLM prompt

Which techniques to apply

You don't need all of them in every system. The order of impact:

Technique	When to apply
Structure-aware chunking	Always — eliminates the most common retrieval failures
Contextual Retrieval	When documents are long, hierarchical, or require understanding their own structure
Hierarchical chunking	When chunks need to be short for precision but long for comprehension
Reranking	When retrieval precision matters more than query latency
HyDE	When queries are short and semantically distant from document vocabulary

Start with structure-aware chunking. That alone eliminates most naive failures. Add the rest as evidence from your evals shows the need.

What This Changes for PMs

The naive view is "we need better chunks." The actual levers are four distinct systems:

Indexing quality: how well the pipeline represents your documents (structure-aware chunking, contextual retrieval)
Index structure: how retrieval granularity relates to comprehension context (hierarchical chunking)
Query translation: how well user intent converts to search vectors (HyDE)
Retrieval precision: how well the system filters retrieved candidates before they reach the LLM (reranking)

When a RAG system gives a wrong answer, you now have four places to look — and three of them are not "the LLM got it wrong."

The operational implication: evals need to instrument each stage separately. Logging only the final answer tells you something went wrong. Logging retrieved chunks before and after reranking tells you where.

PM Tip

The most cost-effective production improvement is usually contextual retrieval. One LLM call per chunk at indexing time — a one-time cost — is cheaper than the compounding cost of wrong answers, user drop-off, and repeated support escalations. Index quality is a fixed investment; retrieval failures are an ongoing cost.

Why Naive RAG Breaks at Scale​

Better Chunking: Semantic Boundaries Over Token Counts​

Structure-first chunking​

Contextual Retrieval​

The insight​

How it works​

Cost and tradeoffs​

Hierarchical Chunking (Parent-Child)​

The pattern​

Implementation note​

HyDE: Improving the Query​

The problem​

The fix​

When HyDE helps​

When HyDE doesn't help​

Reranking: A Second Opinion on Relevance​

The two-stage pattern​

Reranker options​

Putting It Together​

Which techniques to apply​

What This Changes for PMs​

Further Reading​