I spent 3 weeks chasing an answer-quality regression that turned out to be a tokenizer mismatch in a library upgrade. Here's what I learned about evaluating RAG.

On this page

Field Notes: RAG Retrieval Quality Evaluation

About a year ago I joined a team running a customer-facing assistant on top of a RAG pipeline. The system worked, mostly. Then quality slowly started getting worse — answers more vague, more "I don't have that information" responses, more support tickets where users complained the bot got something wrong.

I thought it would take a day to diagnose. It took 3 weeks. Most of those weeks were spent learning that we couldn't measure what we thought we were measuring. This post is what I'd tell my past self.

What we had at the start #

A reasonable-looking eval setup, on paper:

200 hand-curated question/answer pairs covering the main use cases
A pipeline that ran the assistant against each question
A judge LLM (gpt-4o) that graded each response on a 1-5 scale
A weekly cron that ran the eval and posted average scores to Slack

The scores hovered around 4.1-4.3 for months. Then they slowly drifted to 3.8 over four weeks. The drift was real, but the scores didn't tell us why.

Mistake 1: I started with the model #

The first instinct was that something had changed in the model. We were on gpt-4o-mini. I checked the OpenAI changelog: nothing relevant. I tried running the eval against gpt-4o (the bigger model) — scores went up to 4.3, which "explained" things, except it didn't, because gpt-4o-mini had been giving 4.2 for months and now wasn't.

I spent four days A/B testing model versions before I realised the model wasn't the variable that had changed.

Mistake 2: I trusted the score average #

The drop from 4.1 to 3.8 was real, but the average hides a lot. I finally pulled the per-question scores into a spreadsheet and sorted by score change. The pattern was striking: about 30 questions had dropped from 5 to 2-3. The other 170 were unchanged.

Whatever was wrong was specific to a subset. Looking at the 30 affected questions: they all involved looking up specific product SKUs.

Mistake 3: I assumed retrieval was working #

The next obvious step was to check retrieval. I added per-query logging that showed which chunks were retrieved for each question. For the 30 broken questions, I expected to see "wrong chunks retrieved." That's not what I saw.

The chunks were correct. The retrieval was returning exactly what it should — the SKU's product page, the relevant pricing section, the right specs. But the LLM's answer was generic anyway.

What was actually happening #

Two weeks in, I had a colleague look at it with fresh eyes. She asked something I hadn't: "what does the LLM see, exactly?"

I'd assumed it saw the chunk text. What it actually saw was tokenized chunk text reassembled by our retrieval library. We had recently upgraded langchain to a new minor version. The new version changed how it serialized retrieved documents into the prompt — specifically, it stripped a metadata field that we used to format SKUs.

So before:

code

Product: SKU-123-ABC
Description: Widget for X applications
Price: $49.99

After (post-upgrade):

code

Description: Widget for X applications
Price: $49.99

The SKU was gone from what the LLM saw. The chunks looked correct in our logging because we logged the chunk source object, but the actual prompt-time serialization had silently dropped a field.

The fix took ten minutes #

A monkeypatch (we filed an upstream issue, but had to ship something fast). The eval scores went back to 4.1 within a day.

What I learned about evaluating RAG #

Three principles I'd carry forward.

1. Log the prompt that actually went to the model, not the retrieval result #

This is the bug that ate three weeks. Logging the chunks isn't enough. Logging the prompt-as-sent — the exact bytes the LLM received — is what lets you debug "why didn't the model use this information."

Now we log the full prompt for every eval run, hashed and stored. We can diff prompts across versions when scores change.

2. Per-question score, not average #

The average being 3.8 told me there was a problem. The per-question scores told me where the problem was. If I'd started from per-question on day one I'd have caught the SKU-specific pattern immediately.

We now run the eval twice per week and post a Slack summary that includes:

Average score (still useful as a single number)
Top 5 biggest movers vs last week (questions whose score changed by >1.0 in either direction)
Bottom 5 absolute scores (worst-performing questions overall)

The "biggest movers" section is the one that catches drift early.

3. The judge is part of the system #

We assumed the judge LLM (gpt-4o grading) was a fixed point. It isn't. OpenAI updates the model occasionally; in our case a quiet update changed how strictly the judge scored "specific" vs "generic" answers, which made the SKU regression look slightly worse than it would have looked a month earlier.

We pinned the judge to a specific snapshot version. If we want to update the judge, we re-baseline all eval scores against the new judge before comparing to history.

What our eval looks like now #

The eval runs in CI as a non-blocking job on every PR that touches retrieval, prompts, or model config. It also runs as a cron twice per week. The output:

code

Eval run 2026-04-25 (cron)

Score: 4.13 (last week: 4.11, +0.02)
Coverage: 200/200 questions answered
Errors: 0

Top 5 movers vs last week:
  +1.5  q_173: "What's the warranty on SKU-456?"
  +0.8  q_201: "Compare X to Y"
  -0.3  q_088: "Return policy?"
  -0.5  q_142: "Setup instructions for Z"
  -0.5  q_159: "Bulk pricing tiers"

Bottom 5 absolute:
  2.5  q_034: "What's compatible with my XYZ-2018?"
  2.8  q_088: "Return policy?"
  ...

Top 5 absolute:
  5.0  q_001: "Where can I find pricing?"
  ...

The "top movers" section is what the team scans on Monday morning. If something has dropped >1.0, someone investigates.

What we still don't have right #

Coverage breadth. Our 200 hand-curated questions are a fraction of what users actually ask. We've added a sampling pipeline that takes 50 random real user questions per week and adds them to a "candidate eval set" for review. About 10-15 per week make it into the official eval after manual review.

Eval as a deploy gate. We ran experiments with making the eval block deploys. It's too noisy: legitimate variance in the judge's scoring (±0.1-0.2) means a clean change can sometimes show a drop large enough to "fail." We instead surface eval changes prominently in the deploy Slack message, but don't block.

Cost. Running 200 questions × 2 weekly = 400 eval queries plus their judges = ~$3-5/week. Manageable, but it's there.

What I'd tell someone building eval for the first time #

Start small. 30 high-quality questions beat 300 mediocre ones. Each question should have:

A clear expected answer (or rubric)
Coverage of a use case you actually care about
A label (category, difficulty) so you can group later

Build a per-question logger before you trust the average. The first time the average drifts, you'll be glad you have per-question history.

Pin the judge model snapshot. When you do upgrade, re-baseline.

And always — always — log the actual prompt sent to the model. The bug eating three weeks of my life was hiding in a place I wasn't looking, in a layer I assumed was transparent. The fix to my eval setup was learning that no layer is transparent. Log everything that crosses a boundary.

Closing aside #

The colleague who finally asked the right question hadn't worked on RAG before. She walked into the problem fresh, didn't have my assumptions, and asked the dumb question. Most production debugging eventually comes down to someone asking the dumb question that the original engineer dismissed too early. If you're stuck for more than a day, find a colleague who hasn't been steeped in the problem and explain it from scratch.

Field Notes: RAG Retrieval Quality Evaluation

Field Notes: RAG Retrieval Quality Evaluation

What we had at the start #

Mistake 1: I started with the model #

Mistake 2: I trusted the score average #

Mistake 3: I assumed retrieval was working #

What was actually happening #

The fix took ten minutes #

What I learned about evaluating RAG #

1. Log the prompt that actually went to the model, not the retrieval result #

2. Per-question score, not average #

3. The judge is part of the system #

What our eval looks like now #

What we still don't have right #

What I'd tell someone building eval for the first time #

Closing aside #

Stay Updated

Field Notes: Prompt Versioning and Regression Testing

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025

More from AI

Token Budgeting for Long-Context Prompts: What to Cut First

Multi-Provider LLM Gateways: Routing, Fallback, and Cost Control

Streaming LLM Responses: SSE, Backpressure, and Cancellation

Token Budgeting for Long-Context Prompts: What to Cut First

Multi-Provider LLM Gateways: Routing, Fallback, and Cost Control

Streaming LLM Responses: SSE, Backpressure, and Cancellation

Choosing an Embedding Model: Dimensions, Cost, and MTEB Reality

Four Signals That Matter: Choosing SLIs Users Actually Feel

Agent Memory: Short-Term, Long-Term, and When You Need Neither

You might have missed

GitOps with Argo CD: Best Practices for 2025

Process Management and Monitoring in Linux

Linux Performance Tuning for Containers and Kubernetes Nodes

About Kiril Urbonas