The Systematic RAG
Improvement Runbook
Your RAG system is not bad at answering. It is bad at finding. This guide covers the full improvement loop — from measuring retrieval before you touch anything else, to running production experiments that actually tell you what changed and why.
Measure retrieval before you do anything else
The typical RAG improvement cycle goes: bad answer → tweak the prompt → still bad answer → switch the model → still bad answer → add more context → still bad answer. The problem is that none of these interventions touch retrieval — and retrieval is almost always where the failure originates.
Before you change any prompt, model, or chunking strategy, you need a number: what percentage of the time does your retrieval system actually return the right chunk? Synthetic data gives you that number in hours, not weeks.
What the number should be
On your own knowledge base, with questions generated from your own chunks, retrieval recall should be close to 95–98%. If it is not, you have a retrieval problem. No synthesis improvement will compensate for a retrieval layer that regularly misses the relevant document.
Building the baseline in four steps
- 1
Generate questions from each chunk
Use an LLM to produce 3–5 questions that the chunk answers directly. Vary the phrasing — some keyword-heavy, some paraphrased, some inferential. The goal is to stress-test retrieval across different query styles.
- 2
Test each question against your retrieval system
For each synthetic question, run retrieval and check whether the source chunk appears in the top-k results. Record hit or miss per question. This is your raw recall measurement.
- 3
Test full-text and vector search separately
Run the same questions against each method independently before combining them. The performance gap between them reveals which failure mode dominates your dataset — keyword mismatch or semantic drift.
- 4
Lock the number as your baseline
Every experiment going forward is measured against this number. A change that does not improve it is not an improvement, regardless of how the answers feel subjectively.
What the data actually looks like
Vector on long-form docs
~96%
High recall, semantics match well
BM25 on long-form docs
~94%
Slightly lower, much faster
BM25 on structured data
~60%
Keyword mismatch dominates
Long-form prose and structured data behave completely differently. Do not assume your retrieval method transfers across content types — always measure per corpus type.
# Minimal synthetic evaluation loop
hits = 0
total = 0
for chunk in knowledge_base:
questions = llm.generate_questions(chunk.text, n=4)
for question in questions:
results = retriever.search(question, top_k=5)
retrieved_ids = [r.chunk_id for r in results]
if chunk.id in retrieved_ids:
hits += 1
total += 1
recall = hits / total
print(f"Retrieval recall: {recall:.1%}")
# If this number is below 0.90 — fix retrieval first