How we deploy LLM-powered features. The deployment patterns are mostly normal; the validation is where the differences are.
Deploying an LLM-powered feature shouldn't require new deployment tooling. We use the same Kubernetes deployments, GitOps pipeline, and canary rollouts we use for non-AI services. The differences are at the edges: validating that a model change is safe to ship, handling third-party API dependencies, and rolling back when quality regresses. This is the deployment playbook we've landed on.
For LLM features that call external APIs (OpenAI, Anthropic):
For self-hosted inference (our smaller use cases):
nvidia.com/gpu: 1) and use a specific node pool with appropriate GPUs.The "deployment is interesting" stuff happens around prompt updates and model changes, not around the service deployments themselves.
Prompts are code. They live in Git. Changes go through code review. But unlike code, the impact of a prompt change is hard to verify mechanically — a small wording change can degrade quality on the long tail.
Our pipeline for prompt changes:
gpt-4o.The regression suite has 100-300 cases per prompt, hand-curated to cover normal traffic + known edge cases. Curating these is real work; it's the part of LLM development that takes the most effort.
Standard canary works fine for code changes. For prompt or model changes, we tweak it:
Quality metrics in the canary analysis, not just error rate. We compare the canary's response quality (Layer 1 programmatic checks, see our observability post) against stable. If canary quality drops > 5pp, abort.
Longer canary duration for prompt changes. We run canary for 60 minutes for prompt changes vs 20 minutes for code-only. Quality issues take time to manifest.
Sticky routing per user. Within a canary, we route consistently — a single user gets either the canary or the stable version, not random. This avoids confusing users with two slightly different experiences.
For self-hosted models, we deploy two versions side by side and route traffic between them via a simple proxy. Same canary semantics as for prompt changes.
External LLM APIs have outages. Our pipeline:
Multi-provider routing for the most important features. Same prompt, two providers (OpenAI + Anthropic). Cost is duplicated for the warm path, but if one is down, the other takes traffic.
Aggressive timeout and retry. We set 30s timeouts on LLM calls (vs default 600s+). On timeout, we fall back to a degraded response ("I'm having trouble — please try again in a minute") rather than holding the user's connection open.
Status page integration. OpenAI and Anthropic publish status feeds. We watch them and route around regional issues automatically when possible.
Cached fallbacks for common queries. When the API is down, our query-similarity cache (described in the cost optimization post) returns recent answers for similar questions. Quality is degraded but the feature isn't dead.
We had a 4-hour OpenAI outage last year. Multi-provider routing and cached fallbacks kept ~70% of our LLM features functional during it.
OpenAI and Anthropic ship model snapshots and roll users forward over time. "gpt-4o" without a date is a moving target — what you tested today is not necessarily what you ship next month.
Our policy:
gpt-4o-2024-08-06, not gpt-4o.When a snapshot is being deprecated, we get usually 30-60 days notice. We test the new snapshot in our regression suite, fix any prompt incompatibilities, then update the pin.
Without pinning, we've seen silent quality drops when the provider rolled their version forward. Pinning is non-optional in production.
When a deploy goes bad, the rollback path differs by what changed:
Code rollback: standard GitOps revert. Argo syncs back to previous version.
Prompt rollback: the prompts are in Git, so this is also a git revert. Same flow as code.
Model snapshot rollback: the snapshot is also in Git (env var or config file). Same flow.
Self-hosted model rollback: bigger blast radius — re-deploy the previous model image. Slower because images are large (5-15GB).
For all of these, the rollback should not take longer than a few minutes. If it would take longer (e.g., model image needs to be pulled), we keep both versions running simultaneously and toggle the routing config.
Different LLM features have different risk profiles. Our policies:
| Feature | Canary | Quality threshold | Rollback path |
|---|---|---|---|
| Customer-facing assistant | 5% → 25% → 100% over 60min | < 3pp drop aborts | Git revert |
| Internal classifier | 5% → 100% over 30min | < 5pp drop aborts | Git revert |
| Background batch jobs | None (low risk, async) | Run regression weekly | Manual |
| New experimental feature | Behind feature flag, 1% of users | Manual review | Toggle flag off |
Higher-risk features get more careful rollout. Lower-risk features ship faster.
Every new LLM feature is gated by a feature flag. The flag controls:
When a feature has a problem, we can toggle the flag off without redeploying. Faster than a code rollback.
Flags also enable A/B testing — we can run the new version vs the old version on different user cohorts and compare quality metrics directly. We use this for any change we're not 100% confident about.
We have three: dev, staging, prod.
A subtlety: LLM behavior in dev/staging isn't perfectly representative of prod, because:
We accept this limitation and rely heavily on canary + monitoring in prod for catching prod-only issues.
A few times the pipeline didn't catch problems:
A prompt change passed regression tests because the test cases didn't cover the regression's shape. A specific kind of input (multi-turn with context references) wasn't well represented in the test suite. The new prompt did worse on those, but no test case was that shape. Canary caught it; we added the missing shapes to the test suite.
A snapshot upgrade introduced a refusal rate increase. The new model refused some queries the old one would answer (over-cautious safety tuning). Regression suite caught a few; the rest surfaced in production. We rolled back to the previous snapshot and waited for the next one.
A retrieval change broke fact accuracy in ways the judge LLM missed. The judge couldn't tell that "John was born in 1985" vs "John was born in 1995" was a regression because both were plausible. Surfaced via user feedback. We added more fact-checking test cases.
Each incident teaches you something the pipeline didn't catch. The pipeline gets better over time, but it never catches everything.
Pin model snapshots. Provider-side rolling forward is the most common silent regression.
Build the regression suite before you need it. Write 20 hand-curated test cases per LLM feature on day one. Add to it whenever you find a regression.
Use feature flags for new AI features. The ability to toggle off without redeploying is valuable when something is uncertain.
Treat prompt changes like code changes. Review, test, canary. Not "edit and ship."
Multi-provider redundancy where it matters. Single-provider dependence is fine for nice-to-have features; not fine for revenue-critical ones.
Canary against quality metrics, not just error rate. Standard canary checks won't see most LLM-specific issues.
The deployment patterns aren't exotic. They're standard CD with a few additions for LLM-specific risks. The discipline is in applying them consistently — every LLM change goes through the gate, even when it's "just a small prompt tweak."
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
A real story of removing console-only changes, adding drift detection, and getting Terraform back in charge.
A field report from rolling out retrieval-augmented generation in production, including cache bugs, bad embeddings, and how we fixed them.
Explore more articles in this category
Embedding indexes degrade silently. The signals that catch drift, how often to re-embed, and the operational patterns we built after one quiet quality regression.
Streaming LLM responses is easy until the client disconnects, the model stalls, or the user cancels. The patterns that keep streaming responsive without leaking spend.
When LLMs can call tools that change real state, the design decisions that matter most are about what's gated, what's automatic, and what triggers a human checkpoint.
Evergreen posts worth revisiting.