We changed a system prompt for what we thought was a tone improvement and broke a customer-critical extraction overnight. The version control and regression tests we built next.
The incident: a tweak to our customer-support assistant's system prompt — three sentences added to nudge tone toward "warmer, more conversational" — silently broke a JSON extraction the same prompt was responsible for. About 8% of responses started returning malformed JSON instead of the expected schema. We caught it 11 hours after deploy via a downstream pipeline alert. By then, ~30,000 user interactions had degraded outputs.
That was the day we stopped treating prompts as configuration and started treating them like code.
Prompts lived in a prompts.py file checked into the repo. They were updated by anyone, reviewed lightly (often "LGTM, looks better"), and deployed with the next service push. Three things failed:
This post is what we changed for each.
Step one was to pull prompts out of source code into a separate, versioned store. Each prompt is now a row in a database table:
CREATE TABLE prompts (
id UUID PRIMARY KEY,
name TEXT NOT NULL, -- "support_assistant_v1"
version INT NOT NULL,
content TEXT NOT NULL,
variables JSONB NOT NULL, -- schema of expected interpolated variables
created_by TEXT NOT NULL,
created_at TIMESTAMPTZ NOT NULL,
notes TEXT,
UNIQUE(name, version)
);
The application code references prompts by name and a default version. We can pin a specific version per environment if needed. Updating a prompt = inserting a new row with version = N + 1. The old version is preserved and can be rolled back to in seconds.
This solved one immediate problem: rollbacks. Before, "rollback the prompt" meant reverting a commit, redeploying. Now it's a version flag flip in our control plane.
The next layer was a regression eval that runs on every prompt change. We had a small eval set already (200 hand-curated test cases for the support assistant); we expanded it to cover specific behaviours we cared about:
When someone proposes a prompt change, the eval runs against both the current and proposed versions and reports the diff:
PROMPT EVAL: support_assistant_v1 (proposed v=8 vs current v=7)
Tone (judge): v7=4.2 v8=4.5 ✓ improved
JSON parse rate: v7=99.2% v8=92.1% ✗ REGRESSION (-7.1pp)
Refusal correctness: v7=98% v8=98% ✓ stable
Mean response length: v7=180 v8=205 ⚠ +14% — review
p95 latency tokens: v7=420 v8=485 ⚠ +15% — review
The JSON parse rate regression is the one that would have flagged the original incident. The eval catches this kind of thing now, automatically, before merge.
Two of the metrics are blocking: a JSON parse regression of >2pp blocks merge. A refusal-correctness regression of >1pp blocks merge. The qualitative ones (tone, length) are warnings that surface in the PR but don't block.
When the eval flags a regression, the PR shows a comment with the failing test cases. For each, you see:
This makes the failure inspectable. You're not guessing why JSON parsing dropped — you can see, for example, that the new prompt occasionally introduces conversational phrasing INSIDE the JSON block. Fix the prompt; re-run; iterate.
Most prompt changes hit one minor regression on the first try. The cycle takes 5-10 minutes per iteration. Total time from "I want to update this prompt" to "PR merged" is typically 30-60 minutes for a non-trivial change. Slower than before, but the changes ship without breaking things.
Even with the eval, we don't trust 100% rollouts of prompt changes. New prompt versions ship to 5% of traffic for an hour, then 20% for two hours, then 100%. Each step has a hard gate based on production metrics:
If any of those regress more than 10% from baseline during a stage, the rollout halts and reverts.
This caught one real issue: a prompt update passed the eval, rolled to 5%, and the cost-per-request metric ticked up 22%. The eval hadn't caught it because our test cases happened to use short queries where the issue didn't manifest. Production traffic, with longer queries, did. Reverted in 8 minutes.
Worth being concrete about. Our eval set has ~250 test cases, structured as:
{
"id": "support-001",
"category": "billing-question",
"input": "I was charged twice for my subscription this month, can you help?",
"expected": {
"must_include": ["sorry", "investigate"],
"must_not_include": ["please contact your bank"],
"json_schema": { # expected response is JSON for this category
"type": "object",
"required": ["intent", "next_action", "needs_human"],
"properties": {
"intent": {"enum": ["billing_dispute", "billing_question", "unknown"]},
"next_action": {"type": "string"},
"needs_human": {"type": "boolean"}
}
}
},
"tone_target": "warm",
"max_tokens": 200
}
The eval runs each test through both the current and proposed prompts, then evaluates each output against its expected block. The check types are simple — string contains, JSON schema validation, a judge LLM for tone.
Building the eval set took about a sprint. We seeded it with examples we already had from production (sampled from real interactions, anonymized), added edge cases the team thought of, and kept growing it. New categories of input that emerge in production get added; old ones that never represent real traffic get retired.
Eval coverage of edge cases we don't anticipate. The cost regression I mentioned above was caught by production rollout, not by eval. That's expected — eval can only test what we thought of. Staged rollout is the safety net.
Multi-turn conversation eval. Our eval is single-turn. Many real interactions are multi-turn. We've experimented with multi-turn eval; it's noisier and slower. We do it manually for major changes and skip it for small ones.
Prompt cost prediction. Estimating "what would this prompt cost at production volume" is harder than it sounds. We have a rough estimate from eval-time tokens × production volume, but it consistently undershoots reality by 10-15%. We add a buffer.
The "anyone can edit prompts" workflow. Not because we don't trust people, but because we'd accumulated patterns where a prompt would get tweaked without any test, deployed, and either broke nothing (most of the time) or broke a downstream pipeline. The cost of the broken cases far outweighed the convenience of the working cases.
Prompts now follow the same review process as code. PRs require review, eval must pass, staged rollout is automatic.
Two questions consistently:
"Isn't this overkill for a small team?" Maybe. The threshold for "worth it" is when prompts are powering production user-facing decisions. If your prompts are internal tooling for engineers, you can probably skip the rollout staging. If they're customer-facing, the cost of one incident exceeds the cost of building this.
"How big should the eval set be?" The smallest you can make work. We started with 30 test cases. Got to 100 within a month. Settled at ~250. Beyond ~300 we found we were adding noise more than signal — every change had something flagged because the eval was so granular. Pruned aggressively. Quality > quantity for eval cases.
Start with one prompt — your most critical one — and build the workflow around just that prompt. Get the eval right for one prompt before you generalize.
The judge LLM evaluation has variance. Run each eval at least 3 times and average. We had false-positive regressions early because a single judge evaluation can swing ±0.3 points. Three runs averaged is much steadier.
Don't try to make eval block on all metrics. Pick the 1-2 that matter most (for us: JSON parse rate, refusal correctness). The others surface as warnings. Blocking on too many metrics means the deploy treadmill grinds to a halt over fluctuations.
The real value isn't the regression eval — it's the version history. Six months in, when someone asks "why did the assistant start saying X around April?", you can pull up the prompt diff that caused it and reason about the change. Without versioning, you're guessing.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
A DR runbook nobody reads is worse than no runbook. The shape that finally got ours executed correctly under pressure.
I spent 3 weeks chasing an answer-quality regression that turned out to be a tokenizer mismatch in a library upgrade. Here's what I learned about evaluating RAG.
Explore more articles in this category
Embedding indexes degrade silently. The signals that catch drift, how often to re-embed, and the operational patterns we built after one quiet quality regression.
Streaming LLM responses is easy until the client disconnects, the model stalls, or the user cancels. The patterns that keep streaming responsive without leaking spend.
When LLMs can call tools that change real state, the design decisions that matter most are about what's gated, what's automatic, and what triggers a human checkpoint.
Evergreen posts worth revisiting.