We've run canary deploys on most services for two years. The mechanics are easy; the metrics that decide "promote or roll back" are where the design is.
Canary deploys have been our default for stateless services for about two years. The mechanics — route 5% of traffic to a new version, watch metrics, expand or roll back — are straightforward. The hard part is what to watch and how to decide. This post is about that side.
The point of a canary isn't "deploy slowly." It's: catch a bad version while it's small enough to not cause an incident. To do that the canary has to:
Each of these is a design decision with trade-offs.
We do three steps for most services:
The percentages and durations are tuned per service. High-traffic services (>10k req/s) hit statistical significance fast; we sometimes do shorter durations. Low-traffic services (<10 req/s) need longer to accumulate enough data — sometimes the canary stays at 5% for an hour before we have enough signal.
The 25% middle step is debatable. Some teams skip it and go 5% → 100%. We keep it because the second decision point catches issues that show up only under more traffic (some bugs are concurrency-related and don't appear at 5%).
We use Argo Rollouts on top of Kubernetes. The traffic split is implemented by an Istio VirtualService that Argo Rollouts manages:
spec:
strategy:
canary:
steps:
- setWeight: 5
- pause: { duration: 10m }
- setWeight: 25
- pause: { duration: 20m }
- setWeight: 100
analysis:
templates:
- templateName: success-rate
- templateName: latency-p99
startingStep: 1
The analysis runs the named AnalysisTemplates against Prometheus metrics. If any fail, the rollout aborts and reverts.
For services not on the mesh, we use Argo Rollouts with the simpler "two ReplicaSets" approach: scale the new version to 5% of the desired pod count, kube-proxy load balancing splits traffic statistically. Less precise than the mesh-based split but works for most cases.
Two metrics decide promotion or rollback:
Error rate. HTTP 5xx percentage on the canary, compared to the stable. If canary's error rate is meaningfully higher (statistical test, see below), abort.
Latency p99. Same comparison. If canary p99 is more than 20% above stable's, abort.
We considered adding more metrics (CPU, memory, custom business metrics). We've kept it to two for a specific reason: more metrics = more flake. We'd rather have two reliable signals than five that fire false positives.
Custom business metrics matter, but we run them on a longer cadence (post-deploy, not during the canary). E.g., "did checkout completion rate drop?" is checked an hour after full rollout, not during the canary itself.
The naive "is canary's error rate higher than stable's?" check fires on noise. With 5% of traffic, error rate fluctuates a lot — a couple of bad requests gives a 2% error rate even on a fine version.
We use a relative-error-rate test with a confidence threshold:
Promote if:
canary_error_rate < (stable_error_rate * 1.5) + 0.005
OR
total_canary_requests < 1000 # not enough data yet
Translation: canary's error rate has to be > 1.5x stable's, AND at least 0.5% absolute, to count. The "+0.005" floor prevents tripping when stable is 0.1% and canary is 0.2% (3x relative difference but absolute is fine).
We tuned these numbers from production data. False positives (rolling back a fine version) used to happen ~30% of the time; now ~5%.
For this to work, we need accurate per-version metrics. The setup:
version=v123, version=v124)sum(rate(http_requests_total{status=~"5..", version="v124"}[5m]))The version label has to flow from the deployment all the way to the metric. We enforce this with a sidecar that injects the version into all emitted metrics. Without this, your canary analysis is making decisions on noisy aggregated data.
Things we've learned the hard way:
Insufficient traffic to the canary. A 5% canary on a service that gets 5 req/min is 1 request every 4 minutes. You can't make decisions from that. We have a minimum-traffic threshold; if not met, we stay at the current step longer or skip the analysis entirely (riskier but pragmatic for low-traffic services).
Unrelated outages during the canary. The canary trips because of a downstream service failure that affects both versions equally. The relative-error-rate test handles this if both versions are affected — they look similar, no rollback. But if the canary is unlucky and gets more of the bad traffic, it fails. We added "if stable is also degraded, suppress the rollback" logic.
Slow leaks. A memory leak that takes 4 hours to manifest won't show up in a 30-minute canary. Our defense is monitoring after rollout — if memory grows oddly post-rollout, we alert and may roll back. Doesn't catch everything but catches some.
Sticky sessions. Some traffic patterns are session-affinity-based. The 5% of traffic we send to canary may all be from a few "stuck" users, not a representative sample. We use header-based routing (a hash of the user ID, modded by 100, < 5 → canary) to ensure variety.
Schema migrations. Canary doesn't help when the bad change is a database migration. We do schema migrations in a separate change with their own rollback plan — never combined with a code deploy.
Sometimes the analysis says fail but the engineer knows it's a false positive ("I bumped a dependency that has a known performance regression we're accepting"). The engineer can manually promote. This is logged, requires a written reason, and only senior engineers are allowed.
We've used this maybe 10 times in two years. About half of those were correctly overriding a false positive. The other half were the engineer being wrong and rolling back the next day. The override is fine to have but used sparingly.
Some changes don't go through canary:
A full canary takes 30-90 minutes per service depending on tuning. That's the trade for catching bad versions: deploys are slower than kubectl set image, but bad versions affect fewer users.
We don't run canaries in dev or staging — those go straight to 100% to keep iteration fast. Canary is a production-only mechanism.
Specific bad versions that canary stopped:
Without canary, each of these would have been a full-blast incident affecting all users for as long as the rollback took.
Specific bad versions that canary missed:
Canary catches most bad versions. It doesn't catch all. Layered defenses (post-deploy alerts, monitoring, customer reports) cover the rest.
Make canary the default for stateless services. It's a real reduction in incident impact for moderate operational cost.
Two metrics are enough. Error rate and latency. Adding more sounds appealing but each adds noise.
Tune the statistical test against your real data. Off-the-shelf thresholds will either flake or miss. Look at your last 50 deploys and tune to reduce both false positives and false negatives.
Have a manual override but make it visible. The override is sometimes the right call. Logging it keeps people honest.
Don't combine canary with schema migrations or config changes. Different change types, different rollback strategies. Keep them separate.
Measure how many bad versions you've caught. Maintains the case for the canary investment when someone asks "why are deploys slow?"
Canary is one of those operational practices that compounds slowly. The first month is mostly setup and false positives. By month six, it's just how deploys work and you've stopped thinking about it. By year two, you've forgotten what it was like to deploy without canary, and the few times you have to skip it (urgent hotfix), you feel naked.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
A real story of removing console-only changes, adding drift detection, and getting Terraform back in charge.
A field report from rolling out retrieval-augmented generation in production, including cache bugs, bad embeddings, and how we fixed them.
Explore more articles in this category
You always have known vulnerabilities. The question is how you triage, patch, and respond. The discipline we run after a few real incidents and a lot of routine work.
Three terms that get mixed up constantly. The actual differences, where each one sits in the request path, when you reach for which, and where the same tool plays all three roles.
Helm gives you a lot of rope. The patterns we used that backfired, the ones we replaced them with, and what to skip if you're starting today.