We've run canary deploys on most services for two years. The mechanics are easy; the metrics that decide "promote or roll back" are where the design is.

On this page

Canary Releases: A Gradual Rollout Strategy

Canary deploys have been our default for stateless services for about two years. The mechanics — route 5% of traffic to a new version, watch metrics, expand or roll back — are straightforward. The hard part is what to watch and how to decide. This post is about that side.

What we want from a canary #

The point of a canary isn't "deploy slowly." It's: catch a bad version while it's small enough to not cause an incident. To do that the canary has to:

Receive enough traffic to surface real-world bugs (not just synthetic checks)
Run long enough to catch slow-burn issues (memory leaks, intermittent errors)
Have automated metrics that decide "looks okay" or "looks bad"
Roll back fast if "looks bad"

Each of these is a design decision with trade-offs.

Traffic split: 5%, 25%, 100%#

We do three steps for most services:

5% for 10 minutes
25% for 20 minutes
100% (full)

The percentages and durations are tuned per service. High-traffic services (>10k req/s) hit statistical significance fast; we sometimes do shorter durations. Low-traffic services (<10 req/s) need longer to accumulate enough data — sometimes the canary stays at 5% for an hour before we have enough signal.

The 25% middle step is debatable. Some teams skip it and go 5% → 100%. We keep it because the second decision point catches issues that show up only under more traffic (some bugs are concurrency-related and don't appear at 5%).

How we route the traffic #

We use Argo Rollouts on top of Kubernetes. The traffic split is implemented by an Istio VirtualService that Argo Rollouts manages:

yaml.yaml

spec:
  strategy:
    canary:
      steps:
        - setWeight: 5
        - pause: { duration: 10m }
        - setWeight: 25
        - pause: { duration: 20m }
        - setWeight: 100
      analysis:
        templates:
          - templateName: success-rate
          - templateName: latency-p99
        startingStep: 1

The analysis runs the named AnalysisTemplates against Prometheus metrics. If any fail, the rollout aborts and reverts.

For services not on the mesh, we use Argo Rollouts with the simpler "two ReplicaSets" approach: scale the new version to 5% of the desired pod count, kube-proxy load balancing splits traffic statistically. Less precise than the mesh-based split but works for most cases.

What we measure: error rate and latency, that's it #

Two metrics decide promotion or rollback:

Error rate. HTTP 5xx percentage on the canary, compared to the stable. If canary's error rate is meaningfully higher (statistical test, see below), abort.

Latency p99. Same comparison. If canary p99 is more than 20% above stable's, abort.

We considered adding more metrics (CPU, memory, custom business metrics). We've kept it to two for a specific reason: more metrics = more flake. We'd rather have two reliable signals than five that fire false positives.

Custom business metrics matter, but we run them on a longer cadence (post-deploy, not during the canary). E.g., "did checkout completion rate drop?" is checked an hour after full rollout, not during the canary itself.

The statistical test we use #

The naive "is canary's error rate higher than stable's?" check fires on noise. With 5% of traffic, error rate fluctuates a lot — a couple of bad requests gives a 2% error rate even on a fine version.

We use a relative-error-rate test with a confidence threshold:

code

Promote if:
  canary_error_rate < (stable_error_rate * 1.5) + 0.005

OR

  total_canary_requests < 1000  # not enough data yet

Translation: canary's error rate has to be > 1.5x stable's, AND at least 0.5% absolute, to count. The "+0.005" floor prevents tripping when stable is 0.1% and canary is 0.2% (3x relative difference but absolute is fine).

We tuned these numbers from production data. False positives (rolling back a fine version) used to happen ~30% of the time; now ~5%.

The metrics infrastructure #

For this to work, we need accurate per-version metrics. The setup:

Pods are labeled with version (version=v123, version=v124)
Prometheus scrapes per-pod metrics with the version label
Queries aggregate by version: sum(rate(http_requests_total{status=~"5..", version="v124"}[5m]))
Argo Rollouts queries Prometheus directly for analysis

The version label has to flow from the deployment all the way to the metric. We enforce this with a sidecar that injects the version into all emitted metrics. Without this, your canary analysis is making decisions on noisy aggregated data.

What goes wrong #

Things we've learned the hard way:

Insufficient traffic to the canary. A 5% canary on a service that gets 5 req/min is 1 request every 4 minutes. You can't make decisions from that. We have a minimum-traffic threshold; if not met, we stay at the current step longer or skip the analysis entirely (riskier but pragmatic for low-traffic services).

Unrelated outages during the canary. The canary trips because of a downstream service failure that affects both versions equally. The relative-error-rate test handles this if both versions are affected — they look similar, no rollback. But if the canary is unlucky and gets more of the bad traffic, it fails. We added "if stable is also degraded, suppress the rollback" logic.

Slow leaks. A memory leak that takes 4 hours to manifest won't show up in a 30-minute canary. Our defense is monitoring after rollout — if memory grows oddly post-rollout, we alert and may roll back. Doesn't catch everything but catches some.

Sticky sessions. Some traffic patterns are session-affinity-based. The 5% of traffic we send to canary may all be from a few "stuck" users, not a representative sample. We use header-based routing (a hash of the user ID, modded by 100, < 5 → canary) to ensure variety.

Schema migrations. Canary doesn't help when the bad change is a database migration. We do schema migrations in a separate change with their own rollback plan — never combined with a code deploy.

Manual override #

Sometimes the analysis says fail but the engineer knows it's a false positive ("I bumped a dependency that has a known performance regression we're accepting"). The engineer can manually promote. This is logged, requires a written reason, and only senior engineers are allowed.

We've used this maybe 10 times in two years. About half of those were correctly overriding a false positive. The other half were the engineer being wrong and rolling back the next day. The override is fine to have but used sparingly.

What's not on canary #

Some changes don't go through canary:

Configuration changes in our config management system go fleet-wide immediately. The argument was that canary on config is too slow and config changes are usually small. We've debated this; staying with current setup for now.
Schema migrations as mentioned above.
Rollbacks themselves. If we're rolling back a bad version, we go fast — no canary on the rollback. Get back to safety quickly.
Stateful services. Databases, queues, any service where the canary and stable versions can't easily share state. We use blue-green for these instead.

How long it takes #

A full canary takes 30-90 minutes per service depending on tuning. That's the trade for catching bad versions: deploys are slower than kubectl set image, but bad versions affect fewer users.

We don't run canaries in dev or staging — those go straight to 100% to keep iteration fast. Canary is a production-only mechanism.

What we caught #

Specific bad versions that canary stopped:

A change that caused 1% of requests to time out (an off-by-one in connection pool math). At 5% of traffic, the elevated error rate was clear within 10 minutes; rollback at zero customer impact.
A null pointer in a rare code path that affected ~3% of requests. Caught at 25% step; ~30 customers had errors during the canary window. Rollback completed in 90 seconds.
A latency regression (a missing index on a new database query). p99 jumped from 200ms to 800ms. Caught immediately at 5% step.

Without canary, each of these would have been a full-blast incident affecting all users for as long as the rollback took.

What we didn't catch #

Specific bad versions that canary missed:

A bug that only manifested for a specific customer's data shape. The customer was small (< 1% of traffic) and didn't appear in the canary sample. They reported the issue 2 hours after rollout.
A memory leak that ate 50MB/hour. Canary runs were 30 minutes; the leak was invisible at that timescale. Discovered overnight when the OOM killer started.
A dependency upgrade that was incompatible with one specific browser. Canary's traffic was a representative sample of browsers but the failing browser was 0.5% — too small to detect statistically.

Canary catches most bad versions. It doesn't catch all. Layered defenses (post-deploy alerts, monitoring, customer reports) cover the rest.

What I'd tell a team starting #

Make canary the default for stateless services. It's a real reduction in incident impact for moderate operational cost.

Two metrics are enough. Error rate and latency. Adding more sounds appealing but each adds noise.

Tune the statistical test against your real data. Off-the-shelf thresholds will either flake or miss. Look at your last 50 deploys and tune to reduce both false positives and false negatives.

Have a manual override but make it visible. The override is sometimes the right call. Logging it keeps people honest.

Don't combine canary with schema migrations or config changes. Different change types, different rollback strategies. Keep them separate.

Measure how many bad versions you've caught. Maintains the case for the canary investment when someone asks "why are deploys slow?"

Canary is one of those operational practices that compounds slowly. The first month is mostly setup and false positives. By month six, it's just how deploys work and you've stopped thinking about it. By year two, you've forgotten what it was like to deploy without canary, and the few times you have to skip it (urgent hotfix), you feel naked.

Canary Releases: Gradual Rollout Strategy

Canary Releases: A Gradual Rollout Strategy

What we want from a canary #

Traffic split: 5%, 25%, 100%#

How we route the traffic #

What we measure: error rate and latency, that's it #

The statistical test we use #

The metrics infrastructure #

What goes wrong #

Manual override #

What's not on canary #

How long it takes #

What we caught #

What we didn't catch #

What I'd tell a team starting #

Stay Updated

How We Stopped Terraform Drift from Surprising On-Call

Real-World RAG Incidents: Lessons from a Production Rollout

More from DevOps

Feature Flags for Safe Deploys: Decoupling Release From Deploy

On-Call Without Burnout: Rotations, Runbooks, and Escalation

Blameless Postmortems: The Template and Facilitation That Works

Feature Flags for Safe Deploys: Decoupling Release From Deploy

On-Call Without Burnout: Rotations, Runbooks, and Escalation

Blameless Postmortems: The Template and Facilitation That Works

Four Signals That Matter: Choosing SLIs Users Actually Feel

External Secrets Operator: One Secrets Workflow Across Clouds

Kustomize Overlays That Scale Across Environments

You might have missed

Prompt Engineering Best Practices: Maximizing LLM Performance

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025

Process Management and Monitoring in Linux

About Kiril Urbonas