Most LLM eval suites correlate poorly with what real users experience. The eval patterns we run that move with prod metrics — and the ones that lied to us.

On this page

LLM Evals That Actually Predict Production Quality

For a while our LLM eval scores would go up and our prod quality metrics (user thumbs-down, support tickets) wouldn't move. We'd celebrate a "10% improvement" in eval scores; users wouldn't notice. Then a "small change" with a 1% eval score regression would tank user satisfaction. The eval suite was telling us things, but they didn't correlate with what mattered.

This post is what we changed to get evals that predict production. None of it is glamorous — it's mostly the discipline of choosing eval examples carefully and measuring whether the eval moves when users feel things move.

Why eval scores often don't correlate with prod #

A few reasons we found in our suite:

Synthetic test cases are too easy. A hand-curated 50-question eval often has clean inputs the model handles well. Real production queries are messy: typos, partial context, multi-turn references, off-topic interjections. The eval scored 95%; prod scored 70%.

Wrong scoring rubric. "Did the answer mention the relevant facts?" is graded too loosely. Two answers can both mention the facts and have wildly different user value — one is concise and clear, the other is rambling and buries the key info at the end.

Eval ignores the failure modes that matter most. A model that's right 99% of the time and confidently wrong 1% might be worse than one that's right 95% of the time and admits uncertainty in the other 5%. Eval rubrics that ignore "calibration" or "knows when it doesn't know" miss this.

Distribution doesn't match prod. The eval set has a uniform distribution across topics; prod queries are 60% one topic. Eval improvements on the long tail don't move prod metrics dominated by the head.

The eval shape that works for us #

After a few rounds of recalibrating, the patterns that stuck:

1. Sample real production queries as the eval base. Take 200 queries from last month's prod logs (anonymized). These have all the messiness of real users. No more "5% better on hand-curated" measures.

2. Score against multiple criteria, not one composite. For each query we score:

Correctness (factual content)
Relevance (does it actually answer the question)
Clarity (well-structured, readable)
Calibration (does it admit uncertainty when appropriate)
Length (concise vs overly verbose)

A change can improve correctness while regressing clarity. A single composite score masks this.

3. Use a judge LLM for the bulk; humans for sampling. Scoring 200 queries × 5 dimensions = 1,000 judgments. Humans do this on a sample (~50 per release) to validate the judge LLM's scoring isn't drifting. Judge LLM is fast and consistent enough for the bulk.

4. Hold out a "regression set" of failure cases. Every time we discover a real production failure, we add the query (and the expected behavior) to a separate eval set. This set never gets used for tuning — it's a regression guard. New versions must not regress on these.

Categories of eval queries #

Within the 200-query eval set, we deliberately have a mix:

Common path (60%): the queries that match the bulk of prod traffic.
Edge cases (20%): unusual phrasings, multi-language, very long queries, very short queries.
Adversarial (10%): attempts to bypass instructions, prompt injections, requests for inappropriate content.
Should-refuse (10%): queries where the right answer is "I don't have that information." If the model confidently makes something up, that's a regression.

The 60-20-10-10 split matches our prod query distribution roughly. If your distribution is different, match yours.

The judge LLM setup #

For each (query, response) pair, we ask a judge LLM (typically GPT-4 class) to score:

code

You are evaluating an AI assistant's response to a user query.

Score 1-5 on each:
- Correctness: factual accuracy
- Relevance: does it answer the question
- Clarity: is it well-structured
- Calibration: appropriate confidence given the evidence
- Length: appropriate brevity (5 = ideal length)

Query: {query}
Response: {response}

Return JSON: {"correctness": N, "relevance": N, ...}

The judge LLM is more expensive than running the eval through your prod model. We accept this — the eval cost is bounded (~$5-15 per full eval run); the value is in the judgment quality.

To check judge quality: monthly, two humans independently score a 30-sample subset, and we compute agreement with the judge. When agreement drops below 80%, we recalibrate the judge prompt.

What we measure #

For each eval run (which we do per model/prompt change):

Score distribution per dimension (not just the mean). A change that moves the median up but introduces low-score outliers is worth knowing about.
Regression-set pass rate. Should be 100%. Anything less blocks the release.
Comparison with the previous version. Win-rate, loss-rate, tie-rate. Wins must significantly exceed losses for promotion.

We don't compute a single "eval score." The dimensions are different signals.

How we tied eval to prod quality #

The thing that finally calibrated our evals: explicit measurement of correlation between eval scores and prod metrics. Quarterly:

Pick the last 10 model/prompt changes that shipped.
For each, note the eval score delta and the prod-quality delta (user thumbs-down rate, support ticket rate over the following week).
Plot. The correlation should be visible.

The first time we did this, the correlation was weak. We saw "eval score +5%, thumbs-down -0.2%" right next to "eval score -1%, thumbs-down +3%". That was the moment we knew the eval was lying.

Each round of eval improvements (better query sampling, judge prompt refinement, regression set additions) tightened the correlation. After three quarters of work, eval-score and prod-quality moved together reliably.

What we deliberately don't do #

A few patterns that look like they'd help and don't:

Public benchmarks (MMLU, HellaSwag). Our model is good or bad at our task; performance on academic benchmarks is loosely correlated at best. We don't track them.

Continuous online eval against live traffic. Every prod request scored by judge LLM. Cost prohibitive and the signal-to-noise ratio is bad. Sampled production review is better.

Single "leaderboard" eval. Forces a composite score that hides too much. Multiple dimensions, multiple sets.

Letting the prod model judge itself. The judge should be at least as capable as the prod model, ideally a different model family (different blind spots). Self-judgment over-scores generously.

Cadence #

The rhythm:

Every model/prompt change: full eval run on the standard set + regression set. Takes ~10 minutes.
Monthly: human-validate 30 sample scores against judge LLM. Recalibrate if needed.
Quarterly: review correlation with prod metrics; refresh eval set with last quarter's queries.

Total eval-engineering time: maybe a day a month. The setup is worth that.

Common mistakes #

Picking eval queries the model is known to do well on. Selection bias. Sample randomly from prod logs.

Skipping the regression set. Without it, "shipping doesn't regress" is a vibe, not a guarantee. Every real failure goes into the regression set.

Updating the eval and the model in the same release. You can't tell which change moved the score. Update eval; baseline against current model; then update model.

Judge prompt drift without validation. The judge LLM evolves; what it considers a "5" today isn't quite what it considered a "5" six months ago. Monthly human spot-checks catch this.

What to read next #

Prompt engineering best practices — the prompts the evals are evaluating
Field notes: prompt versioning and regression testing — adjacent discipline for prompts specifically
Field notes: RAG retrieval quality evaluation — RAG-specific eval patterns
AI observability: monitoring LLM performance in production — the production side of the picture

LLM evals are infrastructure. The first version doesn't predict prod and that's normal. The discipline of tightening the correlation — quarter by quarter, sample by sample — is what turns evals into a reliable signal. Without it, every release feels like betting.

LLM Evals That Actually Predict Production Quality

LLM Evals That Actually Predict Production Quality

Why eval scores often don't correlate with prod #

The eval shape that works for us #

Categories of eval queries #

The judge LLM setup #

What we measure #

How we tied eval to prod quality #

What we deliberately don't do #

Cadence #

Common mistakes #

What to read next #

Stay Updated

Burn-Rate Alerting — The SLO Discipline That Prevents Alert Fatigue

Terraform Module Versioning and Shared Registries

More from AI

Production RAG Reliability — Making LLM Answers Trustworthy

Shadow Testing and Canary Releases for LLM Changes

Debugging RAG Retrieval — Why It Returns Garbage

Production RAG Reliability — Making LLM Answers Trustworthy

Shadow Testing and Canary Releases for LLM Changes

Debugging RAG Retrieval — Why It Returns Garbage

Long Context vs RAG — When to Use Which

Prompt Injection Defense for LLM Apps

RAG Evaluation Metrics — Faithfulness and Context Precision

You might have missed

GitOps with Argo CD: Best Practices for 2025

Process Management and Monitoring in Linux

Linux Performance Tuning for Containers and Kubernetes Nodes

About Kiril Urbonas