I spent 3 weeks chasing an answer-quality regression that turned out to be a tokenizer mismatch in a library upgrade. Here's what I learned about evaluating RAG.
About a year ago I joined a team running a customer-facing assistant on top of a RAG pipeline. The system worked, mostly. Then quality slowly started getting worse — answers more vague, more "I don't have that information" responses, more support tickets where users complained the bot got something wrong.
I thought it would take a day to diagnose. It took 3 weeks. Most of those weeks were spent learning that we couldn't measure what we thought we were measuring. This post is what I'd tell my past self.
A reasonable-looking eval setup, on paper:
The scores hovered around 4.1-4.3 for months. Then they slowly drifted to 3.8 over four weeks. The drift was real, but the scores didn't tell us why.
The first instinct was that something had changed in the model. We were on gpt-4o-mini. I checked the OpenAI changelog: nothing relevant. I tried running the eval against gpt-4o (the bigger model) — scores went up to 4.3, which "explained" things, except it didn't, because gpt-4o-mini had been giving 4.2 for months and now wasn't.
I spent four days A/B testing model versions before I realised the model wasn't the variable that had changed.
The drop from 4.1 to 3.8 was real, but the average hides a lot. I finally pulled the per-question scores into a spreadsheet and sorted by score change. The pattern was striking: about 30 questions had dropped from 5 to 2-3. The other 170 were unchanged.
Whatever was wrong was specific to a subset. Looking at the 30 affected questions: they all involved looking up specific product SKUs.
The next obvious step was to check retrieval. I added per-query logging that showed which chunks were retrieved for each question. For the 30 broken questions, I expected to see "wrong chunks retrieved." That's not what I saw.
The chunks were correct. The retrieval was returning exactly what it should — the SKU's product page, the relevant pricing section, the right specs. But the LLM's answer was generic anyway.
Two weeks in, I had a colleague look at it with fresh eyes. She asked something I hadn't: "what does the LLM see, exactly?"
I'd assumed it saw the chunk text. What it actually saw was tokenized chunk text reassembled by our retrieval library. We had recently upgraded langchain to a new minor version. The new version changed how it serialized retrieved documents into the prompt — specifically, it stripped a metadata field that we used to format SKUs.
So before:
Product: SKU-123-ABC
Description: Widget for X applications
Price: $49.99
After (post-upgrade):
Description: Widget for X applications
Price: $49.99
The SKU was gone from what the LLM saw. The chunks looked correct in our logging because we logged the chunk source object, but the actual prompt-time serialization had silently dropped a field.
A monkeypatch (we filed an upstream issue, but had to ship something fast). The eval scores went back to 4.1 within a day.
Three principles I'd carry forward.
This is the bug that ate three weeks. Logging the chunks isn't enough. Logging the prompt-as-sent — the exact bytes the LLM received — is what lets you debug "why didn't the model use this information."
Now we log the full prompt for every eval run, hashed and stored. We can diff prompts across versions when scores change.
The average being 3.8 told me there was a problem. The per-question scores told me where the problem was. If I'd started from per-question on day one I'd have caught the SKU-specific pattern immediately.
We now run the eval twice per week and post a Slack summary that includes:
The "biggest movers" section is the one that catches drift early.
We assumed the judge LLM (gpt-4o grading) was a fixed point. It isn't. OpenAI updates the model occasionally; in our case a quiet update changed how strictly the judge scored "specific" vs "generic" answers, which made the SKU regression look slightly worse than it would have looked a month earlier.
We pinned the judge to a specific snapshot version. If we want to update the judge, we re-baseline all eval scores against the new judge before comparing to history.
The eval runs in CI as a non-blocking job on every PR that touches retrieval, prompts, or model config. It also runs as a cron twice per week. The output:
Eval run 2026-04-25 (cron)
Score: 4.13 (last week: 4.11, +0.02)
Coverage: 200/200 questions answered
Errors: 0
Top 5 movers vs last week:
+1.5 q_173: "What's the warranty on SKU-456?"
+0.8 q_201: "Compare X to Y"
-0.3 q_088: "Return policy?"
-0.5 q_142: "Setup instructions for Z"
-0.5 q_159: "Bulk pricing tiers"
Bottom 5 absolute:
2.5 q_034: "What's compatible with my XYZ-2018?"
2.8 q_088: "Return policy?"
...
Top 5 absolute:
5.0 q_001: "Where can I find pricing?"
...
The "top movers" section is what the team scans on Monday morning. If something has dropped >1.0, someone investigates.
Coverage breadth. Our 200 hand-curated questions are a fraction of what users actually ask. We've added a sampling pipeline that takes 50 random real user questions per week and adds them to a "candidate eval set" for review. About 10-15 per week make it into the official eval after manual review.
Eval as a deploy gate. We ran experiments with making the eval block deploys. It's too noisy: legitimate variance in the judge's scoring (±0.1-0.2) means a clean change can sometimes show a drop large enough to "fail." We instead surface eval changes prominently in the deploy Slack message, but don't block.
Cost. Running 200 questions × 2 weekly = 400 eval queries plus their judges = ~$3-5/week. Manageable, but it's there.
Start small. 30 high-quality questions beat 300 mediocre ones. Each question should have:
Build a per-question logger before you trust the average. The first time the average drifts, you'll be glad you have per-question history.
Pin the judge model snapshot. When you do upgrade, re-baseline.
And always — always — log the actual prompt sent to the model. The bug eating three weeks of my life was hiding in a place I wasn't looking, in a layer I assumed was transparent. The fix to my eval setup was learning that no layer is transparent. Log everything that crosses a boundary.
The colleague who finally asked the right question hadn't worked on RAG before. She walked into the problem fresh, didn't have my assumptions, and asked the dumb question. Most production debugging eventually comes down to someone asking the dumb question that the original engineer dismissed too early. If you're stuck for more than a day, find a colleague who hasn't been steeped in the problem and explain it from scratch.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
We changed a system prompt for what we thought was a tone improvement and broke a customer-critical extraction overnight. The version control and regression tests we built next.
How AI agents are moving from read-only copilots to autonomous automation with guardrails. Best practices for approval gates and rollback.
Explore more articles in this category
Embedding indexes degrade silently. The signals that catch drift, how often to re-embed, and the operational patterns we built after one quiet quality regression.
Streaming LLM responses is easy until the client disconnects, the model stalls, or the user cancels. The patterns that keep streaming responsive without leaking spend.
When LLMs can call tools that change real state, the design decisions that matter most are about what's gated, what's automatic, and what triggers a human checkpoint.
Evergreen posts worth revisiting.