Skip to main content

Advanced RAG Techniques

PM: Read in full โ€” 25 min

Why Naive RAG Breaks at Scaleโ€‹

The basic RAG pattern โ€” chunk at 512 tokens, embed, retrieve top-K โ€” works in demos. In production, it has three failure modes that compound each other:

  1. Token-boundary chunks destroy semantic units. A chunk that ends mid-argument and starts mid-paragraph on a different topic retrieves poorly for both.
  2. Decontextualized embeddings lose document structure. A chunk that says "in this context, sanctification refers to ongoing moral transformation" retrieves well for "what is sanctification?" but the embedding can't tell you it's from a chapter on justification where the author is drawing a contrast โ€” the distinction that makes the chunk meaningful.
  3. Query-document semantic gap: a user's question ("why does X fail?") and a document chunk that answers it ("X fails because...") are in different semantic spaces. Embedding the question and finding the answer requires more than nearest-neighbor search.

These failures don't show up in small demos because small demos have few documents and broad queries. They appear in production where documents are long, queries are specific, and the cost of a wrong answer is visible.

Better Chunking: Semantic Boundaries Over Token Countsโ€‹

The 512-token target is a soft constraint, not a boundary. The actual boundary should be a semantic unit โ€” a paragraph, a section, a logical argument.

Structure-first chunkingโ€‹

Split text at structural boundaries first:

  • Double newlines (\n\n) โ€” paragraphs
  • Heading markers โ€” sections
  • Sentence boundaries โ€” fallback only when a paragraph exceeds the token limit

If a paragraph is shorter than the target, accumulate the next paragraph before creating a chunk. Never split inside a sentence to hit a token target.

The goal is that every chunk contains one coherent idea. A chunk that covers half of two ideas retrieves poorly for both. A chunk that fully covers one idea retrieves reliably for queries about that idea.

Contextual Retrievalโ€‹

Structure-aware chunking solves the within-chunk problem. Contextual Retrieval solves the between-chunk problem: a chunk's embedding encodes what it says, but not where it sits in the document or why it matters.

The insightโ€‹

Consider a chunk: "faith, then, is the foundation โ€” but it must be distinguished from the mere intellectual assent that many mistake for it." The embedding encodes "faith, foundation, intellectual assent" โ€” reasonable signal. But without knowing this chunk is from the middle of a structured argument about the prerequisites of salvation, from a document that has already established three prior points, the embedding is missing context that determines which queries it should answer.

How it worksโ€‹

For each chunk, prompt a fast LLM (Haiku, GPT-4o-mini) with the full document and the chunk, and ask it to write 2โ€“3 sentences situating the chunk in the broader document. Prepend that context to the chunk text before embedding. Store the original chunk text separately for display and citation.

The embedding now captures both the content and the role of the chunk. Anthropic published this technique in 2024 and reported a 49% reduction in retrieval failures.

Cost and tradeoffsโ€‹

  • One LLM call per chunk at indexing time โ€” a one-time cost per document
  • Use the cheapest fast model; context generation doesn't require reasoning, just reading comprehension
  • The embedding is generated from the combined text; the LLM reads the context-prefixed version at query time too (it's stored and injected as-is)
  • Adds latency to the indexing pipeline, not to query time

Hierarchical Chunking (Parent-Child)โ€‹

Short chunks retrieve precisely โ€” their embeddings are focused. But when a short chunk lands in the prompt, the LLM may not have enough surrounding context to reason well. Long chunks provide context but dilute the embedding signal, making retrieval less precise.

The solution: separate retrieval size from context size.

The patternโ€‹

Index child chunks (small, ~256 tokens) for retrieval. Every child belongs to a parent chunk (a full section, ~1,500 tokens). At query time: find the right child, then inject the parent into the prompt.

The vector search finds the right location with precision. The LLM reads the full section with context. You get both.

Implementation noteโ€‹

Store a parent_id on each child chunk. When retrieval returns child chunks, a second lookup fetches their parent chunks before building the prompt. If two matched child chunks share a parent, inject the parent once โ€” no need to duplicate.

HyDE: Improving the Queryโ€‹

The techniques above improve the index. HyDE (Hypothetical Document Embeddings) improves the query side.

The problemโ€‹

A user asks: "what causes transformer instability during training?" The query vector lands somewhere in semantic space. The correct document chunk says: "Gradient norm spikes in the early warmup phase are the primary driver of training instability in large Transformers." These two pieces of text are semantically related but not identical โ€” queries are in "question space" and document chunks are in "answer space." The nearest neighbors of the query vector may not be the chunks that actually answer it.

The fixโ€‹

Instead of embedding the query directly:

  1. Prompt an LLM to generate a hypothetical answer to the query โ€” a short passage that would appear in a document that answers this question
  2. Embed the hypothetical answer
  3. Use that embedding for similarity search

The hypothetical answer is in the same semantic space as real document chunks โ€” it uses document-style vocabulary and framing. Even if it's factually wrong (it often is), it lands near the right neighborhood in vector space.

When HyDE helpsโ€‹

  • Short queries against long, dense documents
  • Technical or specialized domains where question phrasing differs from document phrasing
  • Questions phrased as "why does X happen?" against documents that explain X declaratively

When HyDE doesn't helpโ€‹

  • Factual lookups where the query terms already appear verbatim in the document
  • Adds one LLM call per query โ€” latency cost in real-time applications
  • Occasionally the hypothetical is so wrong it steers search to the wrong region

Reranking: A Second Opinion on Relevanceโ€‹

Vector similarity search is fast but approximate. It operates by comparing pre-computed embeddings โ€” the query and document vectors were computed independently and then compared. A cross-encoder reranker reads (query, chunk) pairs together in a single forward pass, letting it attend to the relationship between them directly.

The two-stage patternโ€‹

  1. Retrieve: vector search returns top-50 candidates (fast)
  2. Rerank: pass all 50 (query, chunk) pairs to the reranker; get precise scores
  3. Inject: take the top-5 by reranked score

The reranker is more expensive than vector search but far cheaper than an LLM call. It runs on 50 short pairs โ€” milliseconds, not seconds. The precision improvement is large enough that most production RAG systems use it.

Reranker optionsโ€‹

OptionTypeNotes
Cohere RerankAPI (managed)Easy integration, per-call pricing
BGE-RerankerOpen sourceSelf-hosted, good quality
ms-marco cross-encodersOpen sourceWidely used, multiple size options

Putting It Togetherโ€‹

These techniques compose. A production RAG pipeline looks like this:

Indexing (run once, or when documents change):

  1. Structure-aware chunking at paragraph/section boundaries
  2. Build parent-child chunk relationships
  3. Generate contextual prefix per child chunk (LLM call)
  4. Embed context + chunk text together
  5. Store embedding, context-prefixed text, original text, and parent_id

Query (every request):

  1. (Optional) HyDE: generate hypothetical answer, embed it
  2. Vector search: retrieve top-50 child chunks by embedding similarity
  3. Expand to parent chunks via parent_id lookup
  4. Rerank: score all candidates with cross-encoder, take top-5
  5. Inject top-5 parent chunks into the LLM prompt

Which techniques to applyโ€‹

You don't need all of them in every system. The order of impact:

TechniqueWhen to apply
Structure-aware chunkingAlways โ€” eliminates the most common retrieval failures
Contextual RetrievalWhen documents are long, hierarchical, or require understanding their own structure
Hierarchical chunkingWhen chunks need to be short for precision but long for comprehension
RerankingWhen retrieval precision matters more than query latency
HyDEWhen queries are short and semantically distant from document vocabulary

Start with structure-aware chunking. That alone eliminates most naive failures. Add the rest as evidence from your evals shows the need.

What This Changes for PMsโ€‹

The naive view is "we need better chunks." The actual levers are four distinct systems:

  • Indexing quality: how well the pipeline represents your documents (structure-aware chunking, contextual retrieval)
  • Index structure: how retrieval granularity relates to comprehension context (hierarchical chunking)
  • Query translation: how well user intent converts to search vectors (HyDE)
  • Retrieval precision: how well the system filters retrieved candidates before they reach the LLM (reranking)

When a RAG system gives a wrong answer, you now have four places to look โ€” and three of them are not "the LLM got it wrong."

The operational implication: evals need to instrument each stage separately. Logging only the final answer tells you something went wrong. Logging retrieved chunks before and after reranking tells you where.

PM Tip

The most cost-effective production improvement is usually contextual retrieval. One LLM call per chunk at indexing time โ€” a one-time cost โ€” is cheaper than the compounding cost of wrong answers, user drop-off, and repeated support escalations. Index quality is a fixed investment; retrieval failures are an ongoing cost.

Further Readingโ€‹

Gao, L. et al. โ€” arXiv 2022 (2022)
Read:Abstract, Figure 1, Section 3 (the HyDE method).Skip:Sections 4โ€“6 (experiments, results tables).
HyDE: instead of embedding the user's query directly, generate a hypothetical document that would answer it, then use that embedding for retrieval โ€” bridges the semantic gap between short queries and long document chunks.