How we deploy LLM-powered features. The deployment patterns are mostly normal; the validation is where the differences are.

On this page

Deploying AI Features: Patterns That Work in Production

Deploying an LLM-powered feature shouldn't require new deployment tooling. We use the same Kubernetes deployments, GitOps pipeline, and canary rollouts we use for non-AI services. The differences are at the edges: validating that a model change is safe to ship, handling third-party API dependencies, and rolling back when quality regresses. This is the deployment playbook we've landed on.

The deployments are normal #

For LLM features that call external APIs (OpenAI, Anthropic):

Service is a regular containerized app, deployed via our standard GitOps (Argo CD).
Health checks are standard HTTP endpoints.
Rolling updates work the way they always do.

For self-hosted inference (our smaller use cases):

Inference server is a regular Kubernetes deployment.
Pods request a GPU (nvidia.com/gpu: 1) and use a specific node pool with appropriate GPUs.
Image is built with the model weights baked in (or downloaded at startup from S3).
Otherwise, normal deployment.

The "deployment is interesting" stuff happens around prompt updates and model changes, not around the service deployments themselves.

Prompt deployments: the tricky part #

Prompts are code. They live in Git. Changes go through code review. But unlike code, the impact of a prompt change is hard to verify mechanically — a small wording change can degrade quality on the long tail.

Our pipeline for prompt changes:

Engineer modifies the prompt in a feature branch.
CI runs the regression suite against both old and new prompt. Each test case asks: is the new response better, same, or worse than the old? Judged by gpt-4o.
Diff dashboard. CI produces a comparison: "27 better, 45 same, 8 worse." If "worse" exceeds a threshold, the PR is blocked.
Human reviews the worse cases. Sometimes "worse" is actually better; sometimes it's a real regression.
Merged → deployed via standard GitOps. Canary as usual.

The regression suite has 100-300 cases per prompt, hand-curated to cover normal traffic + known edge cases. Curating these is real work; it's the part of LLM development that takes the most effort.

Canary deploys for AI features #

Standard canary works fine for code changes. For prompt or model changes, we tweak it:

Quality metrics in the canary analysis, not just error rate. We compare the canary's response quality (Layer 1 programmatic checks, see our observability post) against stable. If canary quality drops > 5pp, abort.

Longer canary duration for prompt changes. We run canary for 60 minutes for prompt changes vs 20 minutes for code-only. Quality issues take time to manifest.

Sticky routing per user. Within a canary, we route consistently — a single user gets either the canary or the stable version, not random. This avoids confusing users with two slightly different experiences.

For self-hosted models, we deploy two versions side by side and route traffic between them via a simple proxy. Same canary semantics as for prompt changes.

Handling provider outages #

External LLM APIs have outages. Our pipeline:

Multi-provider routing for the most important features. Same prompt, two providers (OpenAI + Anthropic). Cost is duplicated for the warm path, but if one is down, the other takes traffic.

Aggressive timeout and retry. We set 30s timeouts on LLM calls (vs default 600s+). On timeout, we fall back to a degraded response ("I'm having trouble — please try again in a minute") rather than holding the user's connection open.

Status page integration. OpenAI and Anthropic publish status feeds. We watch them and route around regional issues automatically when possible.

Cached fallbacks for common queries. When the API is down, our query-similarity cache (described in the cost optimization post) returns recent answers for similar questions. Quality is degraded but the feature isn't dead.

We had a 4-hour OpenAI outage last year. Multi-provider routing and cached fallbacks kept ~70% of our LLM features functional during it.

Model version pinning #

OpenAI and Anthropic ship model snapshots and roll users forward over time. "gpt-4o" without a date is a moving target — what you tested today is not necessarily what you ship next month.

Our policy:

Always pin to dated snapshots: gpt-4o-2024-08-06, not gpt-4o.
Treat a snapshot upgrade as a code change. PR, regression tests, canary.
Snapshot upgrades happen on our schedule, not the provider's deprecation schedule.

When a snapshot is being deprecated, we get usually 30-60 days notice. We test the new snapshot in our regression suite, fix any prompt incompatibilities, then update the pin.

Without pinning, we've seen silent quality drops when the provider rolled their version forward. Pinning is non-optional in production.

Rolling back #

When a deploy goes bad, the rollback path differs by what changed:

Code rollback: standard GitOps revert. Argo syncs back to previous version.

Prompt rollback: the prompts are in Git, so this is also a git revert. Same flow as code.

Model snapshot rollback: the snapshot is also in Git (env var or config file). Same flow.

Self-hosted model rollback: bigger blast radius — re-deploy the previous model image. Slower because images are large (5-15GB).

For all of these, the rollback should not take longer than a few minutes. If it would take longer (e.g., model image needs to be pulled), we keep both versions running simultaneously and toggle the routing config.

Per-feature rollout policies #

Different LLM features have different risk profiles. Our policies:

Feature	Canary	Quality threshold	Rollback path
Customer-facing assistant	5% → 25% → 100% over 60min	< 3pp drop aborts	Git revert
Internal classifier	5% → 100% over 30min	< 5pp drop aborts	Git revert
Background batch jobs	None (low risk, async)	Run regression weekly	Manual
New experimental feature	Behind feature flag, 1% of users	Manual review	Toggle flag off

Higher-risk features get more careful rollout. Lower-risk features ship faster.

Feature flags as a safety layer #

Every new LLM feature is gated by a feature flag. The flag controls:

Whether the feature is enabled at all
What percentage of users see it
Per-customer overrides (some customers opt in to beta, some opt out)

When a feature has a problem, we can toggle the flag off without redeploying. Faster than a code rollback.

Flags also enable A/B testing — we can run the new version vs the old version on different user cohorts and compare quality metrics directly. We use this for any change we're not 100% confident about.

Pre-production environments #

We have three: dev, staging, prod.

Dev: every PR's preview environment. LLM calls are real (cost is small at this volume) but use a separate API key with low quota.
Staging: continuous deployment from main. Regression suite runs here. Some QA happens here.
Prod: canary-gated promotion from staging.

A subtlety: LLM behavior in dev/staging isn't perfectly representative of prod, because:

Prod traffic distribution is different from QA test cases.
Prod has rare edge cases that don't show up in lower environments.
Provider model versions can differ between regions or accounts.

We accept this limitation and rely heavily on canary + monitoring in prod for catching prod-only issues.

Specific deployment incidents #

A few times the pipeline didn't catch problems:

A prompt change passed regression tests because the test cases didn't cover the regression's shape. A specific kind of input (multi-turn with context references) wasn't well represented in the test suite. The new prompt did worse on those, but no test case was that shape. Canary caught it; we added the missing shapes to the test suite.

A snapshot upgrade introduced a refusal rate increase. The new model refused some queries the old one would answer (over-cautious safety tuning). Regression suite caught a few; the rest surfaced in production. We rolled back to the previous snapshot and waited for the next one.

A retrieval change broke fact accuracy in ways the judge LLM missed. The judge couldn't tell that "John was born in 1985" vs "John was born in 1995" was a regression because both were plausible. Surfaced via user feedback. We added more fact-checking test cases.

Each incident teaches you something the pipeline didn't catch. The pipeline gets better over time, but it never catches everything.

What I'd tell a team starting #

Pin model snapshots. Provider-side rolling forward is the most common silent regression.

Build the regression suite before you need it. Write 20 hand-curated test cases per LLM feature on day one. Add to it whenever you find a regression.

Use feature flags for new AI features. The ability to toggle off without redeploying is valuable when something is uncertain.

Treat prompt changes like code changes. Review, test, canary. Not "edit and ship."

Multi-provider redundancy where it matters. Single-provider dependence is fine for nice-to-have features; not fine for revenue-critical ones.

Canary against quality metrics, not just error rate. Standard canary checks won't see most LLM-specific issues.

The deployment patterns aren't exotic. They're standard CD with a few additions for LLM-specific risks. The discipline is in applying them consistently — every LLM change goes through the gate, even when it's "just a small prompt tweak."

AI Model Deployment Strategies: From Development to Production

Deploying AI Features: Patterns That Work in Production

The deployments are normal #

Prompt deployments: the tricky part #

Canary deploys for AI features #

Handling provider outages #

Model version pinning #

Rolling back #

Per-feature rollout policies #

Feature flags as a safety layer #

Pre-production environments #

Specific deployment incidents #

What I'd tell a team starting #

Stay Updated

How We Stopped Terraform Drift from Surprising On-Call

Real-World RAG Incidents: Lessons from a Production Rollout

More from AI

Token Budgeting for Long-Context Prompts: What to Cut First

Multi-Provider LLM Gateways: Routing, Fallback, and Cost Control

Streaming LLM Responses: SSE, Backpressure, and Cancellation

Token Budgeting for Long-Context Prompts: What to Cut First

Multi-Provider LLM Gateways: Routing, Fallback, and Cost Control

Streaming LLM Responses: SSE, Backpressure, and Cancellation

Choosing an Embedding Model: Dimensions, Cost, and MTEB Reality

External Secrets Operator: One Secrets Workflow Across Clouds

Four Signals That Matter: Choosing SLIs Users Actually Feel

You might have missed

GitOps with Argo CD: Best Practices for 2025

Process Management and Monitoring in Linux

Linux Performance Tuning for Containers and Kubernetes Nodes

About Kiril Urbonas