Free Guide · The Agent Camp

The Systematic RAG
Improvement Runbook

Your RAG system is not bad at answering. It is bad at finding. This guide covers the full improvement loop — from measuring retrieval before you touch anything else, to running production experiments that actually tell you what changed and why.

01Synthetic Data

02Metadata

03Hybrid Search

04User Feedback

05Topic Clustering

06Monitoring

07Latency Trade-offs

→RAG Course

01Synthetic Data

Measure retrieval before you do anything else

The typical RAG improvement cycle goes: bad answer → tweak the prompt → still bad answer → switch the model → still bad answer → add more context → still bad answer. The problem is that none of these interventions touch retrieval — and retrieval is almost always where the failure originates.

Before you change any prompt, model, or chunking strategy, you need a number: what percentage of the time does your retrieval system actually return the right chunk? Synthetic data gives you that number in hours, not weeks.

What the number should be

On your own knowledge base, with questions generated from your own chunks, retrieval recall should be close to 95–98%. If it is not, you have a retrieval problem. No synthesis improvement will compensate for a retrieval layer that regularly misses the relevant document.

Building the baseline in four steps

1
Generate questions from each chunk
Use an LLM to produce 3–5 questions that the chunk answers directly. Vary the phrasing — some keyword-heavy, some paraphrased, some inferential. The goal is to stress-test retrieval across different query styles.
2
Test each question against your retrieval system
For each synthetic question, run retrieval and check whether the source chunk appears in the top-k results. Record hit or miss per question. This is your raw recall measurement.
3
Test full-text and vector search separately
Run the same questions against each method independently before combining them. The performance gap between them reveals which failure mode dominates your dataset — keyword mismatch or semantic drift.
4
Lock the number as your baseline
Every experiment going forward is measured against this number. A change that does not improve it is not an improvement, regardless of how the answers feel subjectively.

What the data actually looks like

Vector on long-form docs

~96%

High recall, semantics match well

BM25 on long-form docs

~94%

Slightly lower, much faster

BM25 on structured data

~60%

Keyword mismatch dominates

Long-form prose and structured data behave completely differently. Do not assume your retrieval method transfers across content types — always measure per corpus type.

Copy

# Minimal synthetic evaluation loop
hits = 0
total = 0

for chunk in knowledge_base:
    questions = llm.generate_questions(chunk.text, n=4)
    for question in questions:
        results = retriever.search(question, top_k=5)
        retrieved_ids = [r.chunk_id for r in results]
        if chunk.id in retrieved_ids:
            hits += 1
        total += 1

recall = hits / total
print(f"Retrieval recall: {recall:.1%}")
# If this number is below 0.90 — fix retrieval first

The Systematic RAGImprovement Runbook

Measure retrieval before you do anything else

Building the baseline in four steps

What the data actually looks like

The Systematic RAG
Improvement Runbook