We changed a system prompt for what we thought was a tone improvement and broke a customer-critical extraction overnight. The version control and regression tests we built next.

On this page

Field Notes: Prompt Versioning and Regression Testing

The incident: a tweak to our customer-support assistant's system prompt — three sentences added to nudge tone toward "warmer, more conversational" — silently broke a JSON extraction the same prompt was responsible for. About 8% of responses started returning malformed JSON instead of the expected schema. We caught it 11 hours after deploy via a downstream pipeline alert. By then, ~30,000 user interactions had degraded outputs.

That was the day we stopped treating prompts as configuration and started treating them like code.

What was wrong with our setup before #

Prompts lived in a prompts.py file checked into the repo. They were updated by anyone, reviewed lightly (often "LGTM, looks better"), and deployed with the next service push. Three things failed:

No regression test. The prompt had been "good enough" historically; nobody systematically checked that a change didn't break something else.
No diff tooling that surfaced behaviour change. A prompt diff in a PR looks like a text diff — adding three sentences looks small. The behaviour change can be enormous.
No staged rollout. Once merged, the new prompt hit 100% of users immediately. Errors were maximally amplified.

This post is what we changed for each.

Treating the prompt like a versioned artifact #

Step one was to pull prompts out of source code into a separate, versioned store. Each prompt is now a row in a database table:

sql.sql

CREATE TABLE prompts (
  id UUID PRIMARY KEY,
  name TEXT NOT NULL,           -- "support_assistant_v1"
  version INT NOT NULL,
  content TEXT NOT NULL,
  variables JSONB NOT NULL,     -- schema of expected interpolated variables
  created_by TEXT NOT NULL,
  created_at TIMESTAMPTZ NOT NULL,
  notes TEXT,
  UNIQUE(name, version)
);

The application code references prompts by name and a default version. We can pin a specific version per environment if needed. Updating a prompt = inserting a new row with version = N + 1. The old version is preserved and can be rolled back to in seconds.

This solved one immediate problem: rollbacks. Before, "rollback the prompt" meant reverting a commit, redeploying. Now it's a version flag flip in our control plane.

Regression eval as a deploy gate #

The next layer was a regression eval that runs on every prompt change. We had a small eval set already (200 hand-curated test cases for the support assistant); we expanded it to cover specific behaviours we cared about:

Tone (qualitative — judge LLM grades "warmth" 1-5)
Specific JSON extraction (deterministic — does the response parse and match the schema?)
Refusal correctness (when asked something out of scope, does it refuse?)
Length distribution (we don't want responses doubling in length silently)
Latency (p95 of token count, since latency is roughly token-count-bound for LLMs)

When someone proposes a prompt change, the eval runs against both the current and proposed versions and reports the diff:

code

PROMPT EVAL: support_assistant_v1 (proposed v=8 vs current v=7)

Tone (judge):           v7=4.2  v8=4.5  ✓ improved
JSON parse rate:        v7=99.2% v8=92.1%  ✗ REGRESSION (-7.1pp)
Refusal correctness:    v7=98%  v8=98%  ✓ stable
Mean response length:   v7=180  v8=205  ⚠ +14% — review
p95 latency tokens:     v7=420  v8=485  ⚠ +15% — review

The JSON parse rate regression is the one that would have flagged the original incident. The eval catches this kind of thing now, automatically, before merge.

Two of the metrics are blocking: a JSON parse regression of >2pp blocks merge. A refusal-correctness regression of >1pp blocks merge. The qualitative ones (tone, length) are warnings that surface in the PR but don't block.

Eval failures: what to do #

When the eval flags a regression, the PR shows a comment with the failing test cases. For each, you see:

The test input
The output from the current version (v7)
The output from the proposed version (v8)
The judge's reasoning, if it's a qualitative metric

This makes the failure inspectable. You're not guessing why JSON parsing dropped — you can see, for example, that the new prompt occasionally introduces conversational phrasing INSIDE the JSON block. Fix the prompt; re-run; iterate.

Most prompt changes hit one minor regression on the first try. The cycle takes 5-10 minutes per iteration. Total time from "I want to update this prompt" to "PR merged" is typically 30-60 minutes for a non-trivial change. Slower than before, but the changes ship without breaking things.

Staged rollout #

Even with the eval, we don't trust 100% rollouts of prompt changes. New prompt versions ship to 5% of traffic for an hour, then 20% for two hours, then 100%. Each step has a hard gate based on production metrics:

Downstream JSON parse failure rate (logged from the parsing service that consumes our LLM output)
User feedback signals (thumbs-up/down on responses, where collected)
Cost per request (caught a few cases where a prompt change roughly doubled output token count)

If any of those regress more than 10% from baseline during a stage, the rollout halts and reverts.

This caught one real issue: a prompt update passed the eval, rolled to 5%, and the cost-per-request metric ticked up 22%. The eval hadn't caught it because our test cases happened to use short queries where the issue didn't manifest. Production traffic, with longer queries, did. Reverted in 8 minutes.

What our eval set looks like #

Worth being concrete about. Our eval set has ~250 test cases, structured as:

python.python

{
  "id": "support-001",
  "category": "billing-question",
  "input": "I was charged twice for my subscription this month, can you help?",
  "expected": {
    "must_include": ["sorry", "investigate"],
    "must_not_include": ["please contact your bank"],
    "json_schema": {  # expected response is JSON for this category
      "type": "object",
      "required": ["intent", "next_action", "needs_human"],
      "properties": {
        "intent": {"enum": ["billing_dispute", "billing_question", "unknown"]},
        "next_action": {"type": "string"},
        "needs_human": {"type": "boolean"}
      }
    }
  },
  "tone_target": "warm",
  "max_tokens": 200
}

The eval runs each test through both the current and proposed prompts, then evaluates each output against its expected block. The check types are simple — string contains, JSON schema validation, a judge LLM for tone.

Building the eval set took about a sprint. We seeded it with examples we already had from production (sampled from real interactions, anonymized), added edge cases the team thought of, and kept growing it. New categories of input that emerge in production get added; old ones that never represent real traffic get retired.

What we still don't have right #

Eval coverage of edge cases we don't anticipate. The cost regression I mentioned above was caught by production rollout, not by eval. That's expected — eval can only test what we thought of. Staged rollout is the safety net.

Multi-turn conversation eval. Our eval is single-turn. Many real interactions are multi-turn. We've experimented with multi-turn eval; it's noisier and slower. We do it manually for major changes and skip it for small ones.

Prompt cost prediction. Estimating "what would this prompt cost at production volume" is harder than it sounds. We have a rough estimate from eval-time tokens × production volume, but it consistently undershoots reality by 10-15%. We add a buffer.

What we deleted #

The "anyone can edit prompts" workflow. Not because we don't trust people, but because we'd accumulated patterns where a prompt would get tweaked without any test, deployed, and either broke nothing (most of the time) or broke a downstream pipeline. The cost of the broken cases far outweighed the convenience of the working cases.

Prompts now follow the same review process as code. PRs require review, eval must pass, staged rollout is automatic.

What other teams ask us about this #

Two questions consistently:

"Isn't this overkill for a small team?" Maybe. The threshold for "worth it" is when prompts are powering production user-facing decisions. If your prompts are internal tooling for engineers, you can probably skip the rollout staging. If they're customer-facing, the cost of one incident exceeds the cost of building this.

"How big should the eval set be?" The smallest you can make work. We started with 30 test cases. Got to 100 within a month. Settled at ~250. Beyond ~300 we found we were adding noise more than signal — every change had something flagged because the eval was so granular. Pruned aggressively. Quality > quantity for eval cases.

What I'd tell someone building this for the first time #

Start with one prompt — your most critical one — and build the workflow around just that prompt. Get the eval right for one prompt before you generalize.

The judge LLM evaluation has variance. Run each eval at least 3 times and average. We had false-positive regressions early because a single judge evaluation can swing ±0.3 points. Three runs averaged is much steadier.

Don't try to make eval block on all metrics. Pick the 1-2 that matter most (for us: JSON parse rate, refusal correctness). The others surface as warnings. Blocking on too many metrics means the deploy treadmill grinds to a halt over fluctuations.

The real value isn't the regression eval — it's the version history. Six months in, when someone asks "why did the assistant start saying X around April?", you can pull up the prompt diff that caused it and reason about the change. Without versioning, you're guessing.

Field Notes: Prompt Versioning and Regression Testing

Field Notes: Prompt Versioning and Regression Testing

What was wrong with our setup before #

Treating the prompt like a versioned artifact #

Regression eval as a deploy gate #

Eval failures: what to do #

Staged rollout #

What our eval set looks like #

What we still don't have right #

What we deleted #

What other teams ask us about this #

What I'd tell someone building this for the first time #

Stay Updated

Production Playbook: Cloud Disaster Recovery Runbook Design

Field Notes: RAG Retrieval Quality Evaluation

More from AI

Token Budgeting for Long-Context Prompts: What to Cut First

Multi-Provider LLM Gateways: Routing, Fallback, and Cost Control

Streaming LLM Responses: SSE, Backpressure, and Cancellation

Token Budgeting for Long-Context Prompts: What to Cut First

Multi-Provider LLM Gateways: Routing, Fallback, and Cost Control

Streaming LLM Responses: SSE, Backpressure, and Cancellation

Choosing an Embedding Model: Dimensions, Cost, and MTEB Reality

Four Signals That Matter: Choosing SLIs Users Actually Feel

Agent Memory: Short-Term, Long-Term, and When You Need Neither

You might have missed

GitOps with Argo CD: Best Practices for 2025

Process Management and Monitoring in Linux

Linux Performance Tuning for Containers and Kubernetes Nodes

About Kiril Urbonas