You can't improve retrieval you don't measure. The offline eval harness that lets us change embeddings, chunking, and rerankers with confidence instead of vibes — with the metrics that actually predict production quality.
Every RAG quality problem we've debugged traced back to retrieval, not generation. The model can only answer from what you put in its context; if the right chunk wasn't retrieved, no prompt engineering saves you. Yet most teams evaluate RAG by eyeballing a few answers. That doesn't scale and it doesn't catch regressions. Here's the offline harness that let us change embeddings, chunking, and rerankers and know whether quality moved.
The first principle: measure retrieval independently. End-to-end answer quality conflates two failures — bad retrieval and bad generation — and you can't fix what you can't isolate. Build two harnesses:
Most of your iteration happens in (1), because retrieval is where most quality lives and it's the part you can evaluate in milliseconds.
You need labeled data: queries paired with the chunk(s) that should be retrieved. Sources:
golden = [
{"query": "how do I rotate the signing key",
"relevant_chunk_ids": ["doc_42#3", "doc_42#4"]},
{"query": "what's the default retention period",
"relevant_chunk_ids": ["doc_17#1"]},
# ...
]
def recall_at_k(retrieved_ids, relevant_ids, k):
top_k = retrieved_ids[:k]
hit = len(set(top_k) & set(relevant_ids))
return hit / len(relevant_ids)
def reciprocal_rank(retrieved_ids, relevant_ids):
for i, rid in enumerate(retrieved_ids, start=1):
if rid in relevant_ids:
return 1.0 / i
return 0.0
The harness turns "I think the new embedding model is better" into a number:
def evaluate(retriever, golden, k=8):
recalls, rrs = [], []
for case in golden:
ids = retriever.search(case["query"], k=k)
recalls.append(recall_at_k(ids, case["relevant_chunk_ids"], k))
rrs.append(reciprocal_rank(ids, case["relevant_chunk_ids"]))
return {"recall@k": mean(recalls), "mrr": mean(rrs)}
We run this in CI on the golden set for every change to the retrieval stack. A PR that swaps the embedding model now shows recall@8: 0.82 → 0.89 or recall@8: 0.82 → 0.74 — and the second one doesn't merge.
A single aggregate recall number hides the failures that matter. Always break down by query category (factual, multi-hop, time-sensitive, rare-term). A model that's +5% on average but -20% on the multi-hop slice is a regression for your hardest users. The aggregate is for the dashboard; the slices are for the decision.
Retrieval quality is a measurable engineering quantity, not a feel. Build the golden set once, automate recall@k and MRR in CI, separate retrieval from generation, and slice by query type. Then every change to embeddings, chunking, or reranking becomes a number you can defend — and regressions stop reaching production disguised as improvements.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
Cause-based alerts page you for things that don't matter and miss things that do. How we rebuilt alerting around SLO burn rates — multi-window, multi-burn-rate — and cut pages while catching more real pain.
State drift is silent until a deploy fails or an outage reveals it. The scheduled plan-and-diff pipeline that surfaces console hotfixes and manual edits while they're still cheap to reconcile.
Explore more articles in this category
A long, stable system prompt re-billed on every request is money on fire. How prompt caching works, where the cache boundary belongs, and the structuring discipline that got us a big cost and latency cut without changing behavior.
Parsing model output with a regex and a prayer doesn't survive contact with traffic. The validation layers that keep structured LLM output reliable — constrained decoding, schema validation, and the repair loop.
They solve different problems. RAG injects knowledge; fine-tuning changes behavior. The decision criteria, the hybrid pattern, and what we'd do over.
Evergreen posts worth revisiting.