Free Guide · The Agent Camp

The Systematic RAG
Improvement Runbook

Your RAG system is not bad at answering. It is bad at finding. This guide covers the full improvement loop — from measuring retrieval before you touch anything else, to running production experiments that actually tell you what changed and why.

01Synthetic Data
02Metadata
03Hybrid Search
04User Feedback
05Topic Clustering
06Monitoring
07Latency Trade-offs
RAG Course
01Synthetic Data

Measure retrieval before you do anything else

The typical RAG improvement cycle goes: bad answer → tweak the prompt → still bad answer → switch the model → still bad answer → add more context → still bad answer. The problem is that none of these interventions touch retrieval — and retrieval is almost always where the failure originates.

Before you change any prompt, model, or chunking strategy, you need a number: what percentage of the time does your retrieval system actually return the right chunk? Synthetic data gives you that number in hours, not weeks.

What the number should be

On your own knowledge base, with questions generated from your own chunks, retrieval recall should be close to 95–98%. If it is not, you have a retrieval problem. No synthesis improvement will compensate for a retrieval layer that regularly misses the relevant document.

Building the baseline in four steps

  1. 1

    Generate questions from each chunk

    Use an LLM to produce 3–5 questions that the chunk answers directly. Vary the phrasing — some keyword-heavy, some paraphrased, some inferential. The goal is to stress-test retrieval across different query styles.

  2. 2

    Test each question against your retrieval system

    For each synthetic question, run retrieval and check whether the source chunk appears in the top-k results. Record hit or miss per question. This is your raw recall measurement.

  3. 3

    Test full-text and vector search separately

    Run the same questions against each method independently before combining them. The performance gap between them reveals which failure mode dominates your dataset — keyword mismatch or semantic drift.

  4. 4

    Lock the number as your baseline

    Every experiment going forward is measured against this number. A change that does not improve it is not an improvement, regardless of how the answers feel subjectively.

What the data actually looks like

Vector on long-form docs

~96%

High recall, semantics match well

BM25 on long-form docs

~94%

Slightly lower, much faster

BM25 on structured data

~60%

Keyword mismatch dominates

Long-form prose and structured data behave completely differently. Do not assume your retrieval method transfers across content types — always measure per corpus type.

Copy
# Minimal synthetic evaluation loop
hits = 0
total = 0

for chunk in knowledge_base:
    questions = llm.generate_questions(chunk.text, n=4)
    for question in questions:
        results = retriever.search(question, top_k=5)
        retrieved_ids = [r.chunk_id for r in results]
        if chunk.id in retrieved_ids:
            hits += 1
        total += 1

recall = hits / total
print(f"Retrieval recall: {recall:.1%}")
# If this number is below 0.90 — fix retrieval first