You can't improve retrieval you don't measure. The offline eval harness that lets us change embeddings, chunking, and rerankers with confidence instead of vibes — with the metrics that actually predict production quality.

On this page

RAG Retrieval Evaluation — Building an Offline Eval Harness Before You Ship

Every RAG quality problem we've debugged traced back to retrieval, not generation. The model can only answer from what you put in its context; if the right chunk wasn't retrieved, no prompt engineering saves you. Yet most teams evaluate RAG by eyeballing a few answers. That doesn't scale and it doesn't catch regressions. Here's the offline harness that let us change embeddings, chunking, and rerankers and know whether quality moved.

Separate retrieval evaluation from generation evaluation #

The first principle: measure retrieval independently. End-to-end answer quality conflates two failures — bad retrieval and bad generation — and you can't fix what you can't isolate. Build two harnesses:

Retrieval eval: given a query, did we fetch the chunks that contain the answer? (Metrics: recall@k, MRR, nDCG.) Fast, deterministic, cheap — no LLM call.
Generation eval: given the right chunks, did the model produce a faithful answer? (Faithfulness, answer relevance.) Slower, needs an LLM judge.

Most of your iteration happens in (1), because retrieval is where most quality lives and it's the part you can evaluate in milliseconds.

Build a golden set #

You need labeled data: queries paired with the chunk(s) that should be retrieved. Sources:

Mine production logs: real user queries are gold. Sample them.
Annotate the answer location: for each query, mark which document/chunk contains the answer. This is the tedious part; 100–200 well-labeled queries beats 10,000 unlabeled ones.
Cover the failure modes you've seen: acronyms, multi-hop questions, near-duplicate documents, time-sensitive queries.

python.python

golden = [
    {"query": "how do I rotate the signing key",
     "relevant_chunk_ids": ["doc_42#3", "doc_42#4"]},
    {"query": "what's the default retention period",
     "relevant_chunk_ids": ["doc_17#1"]},
    # ...
]

The core metrics #

python.python

def recall_at_k(retrieved_ids, relevant_ids, k):
    top_k = retrieved_ids[:k]
    hit = len(set(top_k) & set(relevant_ids))
    return hit / len(relevant_ids)

def reciprocal_rank(retrieved_ids, relevant_ids):
    for i, rid in enumerate(retrieved_ids, start=1):
        if rid in relevant_ids:
            return 1.0 / i
    return 0.0

recall@k answers "is the answer even in the context we'll send the model?" If recall@8 is 0.7, then 30% of the time the model literally cannot answer correctly no matter how good it is. This is the single most important retrieval number.
MRR (mean reciprocal rank) answers "how high up is the right chunk?" Matters because context-window position affects how well the model uses it — buried-in-the-middle chunks get used less.

Run it on every change #

The harness turns "I think the new embedding model is better" into a number:

python.python

def evaluate(retriever, golden, k=8):
    recalls, rrs = [], []
    for case in golden:
        ids = retriever.search(case["query"], k=k)
        recalls.append(recall_at_k(ids, case["relevant_chunk_ids"], k))
        rrs.append(reciprocal_rank(ids, case["relevant_chunk_ids"]))
    return {"recall@k": mean(recalls), "mrr": mean(rrs)}

We run this in CI on the golden set for every change to the retrieval stack. A PR that swaps the embedding model now shows recall@8: 0.82 → 0.89 or recall@8: 0.82 → 0.74 — and the second one doesn't merge.

What it caught that vibes missed #

An embedding upgrade that improved average recall but tanked it specifically on acronym queries — invisible in spot checks, obvious when we sliced metrics by query type.
A chunking change (bigger chunks) that raised recall but pushed the relevant text to the bottom of larger chunks, hurting generation faithfulness downstream. We'd have shipped it on retrieval metrics alone; the generation harness flagged it.
A reranker that improved MRR but added 180ms p95 for a recall gain that didn't justify the latency. Now a measurable tradeoff, not a guess.

Slice, don't average #

A single aggregate recall number hides the failures that matter. Always break down by query category (factual, multi-hop, time-sensitive, rare-term). A model that's +5% on average but -20% on the multi-hop slice is a regression for your hardest users. The aggregate is for the dashboard; the slices are for the decision.

The discipline #

Retrieval quality is a measurable engineering quantity, not a feel. Build the golden set once, automate recall@k and MRR in CI, separate retrieval from generation, and slice by query type. Then every change to embeddings, chunking, or reranking becomes a number you can defend — and regressions stop reaching production disguised as improvements.

RAG Retrieval Evaluation — Building an Offline Eval Harness Before You Ship

RAG Retrieval Evaluation — Building an Offline Eval Harness Before You Ship

Separate retrieval evaluation from generation evaluation #

Build a golden set #

The core metrics #

Run it on every change #

What it caught that vibes missed #

Slice, don't average #

The discipline #

Stay Updated

Alert on Symptoms, Not Causes — SLO Burn-Rate Alerting in Practice

Terraform Drift Detection in CI — Catching Out-of-Band Changes Before They Bite

More from AI

Prompt Caching for Production LLM Apps — Cutting Cost and Latency at the Token Layer

LLM Output Validation — Schema-Constrained Generation in Production

RAG vs Fine-Tuning — Picking the Right Tool, Honestly

Prompt Caching for Production LLM Apps — Cutting Cost and Latency at the Token Layer

LLM Output Validation — Schema-Constrained Generation in Production

RAG vs Fine-Tuning — Picking the Right Tool, Honestly

LLM Cost Optimization in Production — What Actually Moves the Bill

LLM Evals That Actually Predict Production Quality

Multi-Provider LLM Routing — Failover, Cost Routing, and Load Balancing

About Kiril Urbonas

You might have missed

GitOps with Argo CD: Best Practices for 2025

Process Management and Monitoring in Linux

Linux Performance Tuning for Containers and Kubernetes Nodes

RAG Retrieval Evaluation — Building an Offline Eval Harness Before You Ship

Separate retrieval evaluation from generation evaluation#

Build a golden set#

The core metrics#

Run it on every change#

What it caught that vibes missed#

Slice, don't average#

The discipline#

Stay Updated

Alert on Symptoms, Not Causes — SLO Burn-Rate Alerting in Practice

Terraform Drift Detection in CI — Catching Out-of-Band Changes Before They Bite

More from AI

Prompt Caching for Production LLM Apps — Cutting Cost and Latency at the Token Layer

LLM Output Validation — Schema-Constrained Generation in Production

RAG vs Fine-Tuning — Picking the Right Tool, Honestly

About Kiril Urbonas

You might have missed

GitOps with Argo CD: Best Practices for 2025

Process Management and Monitoring in Linux

Linux Performance Tuning for Containers and Kubernetes Nodes

Separate retrieval evaluation from generation evaluation #

Build a golden set #

The core metrics #

Run it on every change #

What it caught that vibes missed #

Slice, don't average #

The discipline #