Most LLM eval suites correlate poorly with what real users experience. The eval patterns we run that move with prod metrics — and the ones that lied to us.
For a while our LLM eval scores would go up and our prod quality metrics (user thumbs-down, support tickets) wouldn't move. We'd celebrate a "10% improvement" in eval scores; users wouldn't notice. Then a "small change" with a 1% eval score regression would tank user satisfaction. The eval suite was telling us things, but they didn't correlate with what mattered.
This post is what we changed to get evals that predict production. None of it is glamorous — it's mostly the discipline of choosing eval examples carefully and measuring whether the eval moves when users feel things move.
A few reasons we found in our suite:
Synthetic test cases are too easy. A hand-curated 50-question eval often has clean inputs the model handles well. Real production queries are messy: typos, partial context, multi-turn references, off-topic interjections. The eval scored 95%; prod scored 70%.
Wrong scoring rubric. "Did the answer mention the relevant facts?" is graded too loosely. Two answers can both mention the facts and have wildly different user value — one is concise and clear, the other is rambling and buries the key info at the end.
Eval ignores the failure modes that matter most. A model that's right 99% of the time and confidently wrong 1% might be worse than one that's right 95% of the time and admits uncertainty in the other 5%. Eval rubrics that ignore "calibration" or "knows when it doesn't know" miss this.
Distribution doesn't match prod. The eval set has a uniform distribution across topics; prod queries are 60% one topic. Eval improvements on the long tail don't move prod metrics dominated by the head.
After a few rounds of recalibrating, the patterns that stuck:
1. Sample real production queries as the eval base. Take 200 queries from last month's prod logs (anonymized). These have all the messiness of real users. No more "5% better on hand-curated" measures.
2. Score against multiple criteria, not one composite. For each query we score:
A change can improve correctness while regressing clarity. A single composite score masks this.
3. Use a judge LLM for the bulk; humans for sampling. Scoring 200 queries × 5 dimensions = 1,000 judgments. Humans do this on a sample (~50 per release) to validate the judge LLM's scoring isn't drifting. Judge LLM is fast and consistent enough for the bulk.
4. Hold out a "regression set" of failure cases. Every time we discover a real production failure, we add the query (and the expected behavior) to a separate eval set. This set never gets used for tuning — it's a regression guard. New versions must not regress on these.
Within the 200-query eval set, we deliberately have a mix:
The 60-20-10-10 split matches our prod query distribution roughly. If your distribution is different, match yours.
For each (query, response) pair, we ask a judge LLM (typically GPT-4 class) to score:
You are evaluating an AI assistant's response to a user query.
Score 1-5 on each:
- Correctness: factual accuracy
- Relevance: does it answer the question
- Clarity: is it well-structured
- Calibration: appropriate confidence given the evidence
- Length: appropriate brevity (5 = ideal length)
Query: {query}
Response: {response}
Return JSON: {"correctness": N, "relevance": N, ...}
The judge LLM is more expensive than running the eval through your prod model. We accept this — the eval cost is bounded (~$5-15 per full eval run); the value is in the judgment quality.
To check judge quality: monthly, two humans independently score a 30-sample subset, and we compute agreement with the judge. When agreement drops below 80%, we recalibrate the judge prompt.
For each eval run (which we do per model/prompt change):
We don't compute a single "eval score." The dimensions are different signals.
The thing that finally calibrated our evals: explicit measurement of correlation between eval scores and prod metrics. Quarterly:
The first time we did this, the correlation was weak. We saw "eval score +5%, thumbs-down -0.2%" right next to "eval score -1%, thumbs-down +3%". That was the moment we knew the eval was lying.
Each round of eval improvements (better query sampling, judge prompt refinement, regression set additions) tightened the correlation. After three quarters of work, eval-score and prod-quality moved together reliably.
A few patterns that look like they'd help and don't:
Public benchmarks (MMLU, HellaSwag). Our model is good or bad at our task; performance on academic benchmarks is loosely correlated at best. We don't track them.
Continuous online eval against live traffic. Every prod request scored by judge LLM. Cost prohibitive and the signal-to-noise ratio is bad. Sampled production review is better.
Single "leaderboard" eval. Forces a composite score that hides too much. Multiple dimensions, multiple sets.
Letting the prod model judge itself. The judge should be at least as capable as the prod model, ideally a different model family (different blind spots). Self-judgment over-scores generously.
The rhythm:
Total eval-engineering time: maybe a day a month. The setup is worth that.
Picking eval queries the model is known to do well on. Selection bias. Sample randomly from prod logs.
Skipping the regression set. Without it, "shipping doesn't regress" is a vibe, not a guarantee. Every real failure goes into the regression set.
Updating the eval and the model in the same release. You can't tell which change moved the score. Update eval; baseline against current model; then update model.
Judge prompt drift without validation. The judge LLM evolves; what it considers a "5" today isn't quite what it considered a "5" six months ago. Monthly human spot-checks catch this.
LLM evals are infrastructure. The first version doesn't predict prod and that's normal. The discipline of tightening the correlation — quarter by quarter, sample by sample — is what turns evals into a reliable signal. Without it, every release feels like betting.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
Static thresholds on error rate produce noisy alerts. Burn-rate alerting flips the question to "are we burning the error budget faster than we can sustain?" — and pages only on real problems.
Version-pinned modules across many repos. The release process, semver discipline, and the breaking-change communication that keeps a shared registry sane.
Explore more articles in this category
AI agents for incident triage sound great in demos. We've tried it in production. The patterns that earn their keep, the ones that backfire, and where humans still beat agents.
Single-provider LLM apps fail when the provider does. Multi-provider routing isn't just resilience — it's also a cost lever. The patterns we run.
Pure vector search misses exact-keyword queries. Pure BM25 misses semantic ones. Combining them with reciprocal rank fusion is the simplest large win in RAG retrieval.
Evergreen posts worth revisiting.