Argo CD ships your manifests; Argo Rollouts ships them gradually with automated quality gates. The setup, the analysis templates that earn their place, and what we measure.
Argo CD is good at one thing: making cluster state match Git. It will happily ship a broken deployment as fast as it can pull manifests. Argo Rollouts adds the missing layer — progressive delivery, automated analysis, abort-on-failure. Together they're a complete deploy story; Argo CD alone is half of one. This post is what running Rollouts in production looks like.
Plain Kubernetes Deployments do rolling updates: take down N pods of the old version, bring up N of the new, repeat until all replicas are on the new version. Fast, but no quality gates between steps. If the new version is broken, you find out when the last user errors come in.
Argo Rollouts replaces the Deployment resource with a Rollout resource. Same shape, plus:
The Rollout custom resource is the entry point; everything else hangs off it.
A Rollout we use for one of our HTTP APIs:
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: api
spec:
replicas: 6
selector:
matchLabels:
app: api
template:
# standard Pod spec
...
strategy:
canary:
maxSurge: "25%"
maxUnavailable: 0
steps:
- setWeight: 10
- pause: { duration: 5m }
- analysis:
templates:
- templateName: success-rate
- templateName: latency-p99
- setWeight: 25
- pause: { duration: 10m }
- analysis:
templates:
- templateName: success-rate
- templateName: latency-p99
- setWeight: 50
- pause: { duration: 10m }
- setWeight: 100
Flow:
Total time: ~30 minutes for a typical rollout. Slower than kubectl set image but with quality gates between each step.
This is the part most teams under-invest in. An analysis template defines a metric, a query, and a success condition. Rollouts queries it during the rollout's analysis steps.
The two we use universally:
success-rate.yaml:
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: success-rate
spec:
args:
- name: service-name
metrics:
- name: success-rate
interval: 30s
successCondition: result[0] >= 0.99
failureLimit: 3
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
sum(rate(http_requests_total{
service="{{args.service-name}}",
status!~"5..",
version=~".*canary.*"
}[2m]))
/
sum(rate(http_requests_total{
service="{{args.service-name}}",
version=~".*canary.*"
}[2m]))
Reads as: every 30 seconds, query Prometheus for "non-5xx rate / total rate on the canary." If it drops below 99% for 3 consecutive checks (failureLimit), abort.
The version=~".*canary.*" is the key — comparing only canary traffic, not aggregated across versions. Without that filter, the stable version's traffic drowns out canary errors.
latency-p99.yaml:
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: latency-p99
spec:
metrics:
- name: latency-p99
interval: 30s
successCondition: result[0] < 1.0
failureLimit: 3
provider:
prometheus:
address: http://prometheus.monitoring:9090
query: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{
service="{{args.service-name}}",
version=~".*canary.*"
}[2m])) by (le)
)
P99 latency on the canary, alerts if > 1 second.
We also have a couple of service-specific ones (a synthetic that hits a critical endpoint, a check that the canary's writing to the right metric topic). The general pattern: one or two universal analyses + a couple service-specific ones.
The canary pattern relies on the service mesh (or ingress, or whatever's in front) routing traffic to the canary based on pod labels. Rollouts manages a separate set of pods labeled version: canary; your Service or VirtualService weights traffic between version: stable and version: canary based on the rollout step.
For Istio, the integration is built-in — Rollouts manages the VirtualService weights. For Nginx ingress, plain Kubernetes Services, or other routers, Rollouts can manage them too via specific provider configs.
The metrics queries in your analyses also filter by these labels — that's why the version=~".*canary.*" selector appears in the queries above. If you don't filter, you compare against the whole service and miss the canary signal.
Real examples where Rollouts saved a bad deploy from reaching 100% of users:
Memory leak in a Node service. New version had a closure that kept references; memory grew steadily. Analysis didn't catch this on the latency template (latency was fine for the first 20 minutes). Caught it on a custom analysis we added later — "memory growth rate" — which is now in the standard template set for that service.
Database connection pool misconfigured. New version had a higher pool size; old version's connections were still open. Result: more concurrent DB connections, some timing out. P99 latency rose; the analysis aborted at step 2 (25% weight). Fixed the config, retried.
Successful-2xx-rate looks fine but content is wrong. A bug in a response handler returned 200 with empty body. Latency was fine; status was fine; users were broken. The analysis didn't catch it because we hadn't built a content-correctness check. Added one. This is the kind of mistake that drives template improvements.
The pattern: every aborted deploy teaches you something. The deploys that complete and silently regress to 100% rollout teach you nothing.
Honest list:
Internal-only services with one user (the team). A staging deploy that 5 engineers use doesn't need a 30-minute canary. kubectl set image is fine.
Services with very low traffic. If the canary at 10% gets 3 requests per minute, your analysis is statistically meaningless. Either skip canary or use blue-green instead (full switch with smoke tests).
Services without good metrics. If you can't reliably measure success rate and latency, you can't analyze a rollout. Fix observability first.
One-off scripts and batch jobs. Run-to-completion workloads don't fit the Rollout model.
The mechanical setup:
kind: Deployment for kind: Rollout).Per service: a working canary rollout takes maybe a day to set up the first time, an hour for subsequent services. The cost is real but bounded.
For ongoing operations: the Argo Rollouts dashboard shows in-progress rollouts; kubectl argo rollouts list does the same from the CLI. Aborted rollouts page on-call; we have a runbook for "what to do when a canary aborts."
A few features we've left alone:
Experiment resources for full A/B testing with traffic split based on user attributes. Useful for some teams; we have a separate experimentation platform.
Header-based routing for canary (route only certain users to canary). Powerful but adds complexity. We use weighted routing.
Blue-green strategy. We use canary almost exclusively. Blue-green only fits for stateful workloads where two versions can't coexist, and we don't have many of those.
Argo Rollouts is the missing piece between "Argo CD applied my manifest" and "my users got a working deploy." Once you have good metrics and a few analysis templates, the gate between bad deploys and your users is automated. The teams that lean on this hardest are the ones who ship multiple times per day; even at lower velocity it pays back.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
bpftrace one-liners replace strace, perf top, and a half-dozen ad-hoc debugging scripts. The patterns that actually earn their place when you're troubleshooting at 2 AM.
EXPLAIN ANALYZE output is dense and intimidating. Once you can read it, most slow-query investigations finish in minutes. The patterns we keep seeing.
Explore more articles in this category
Production monitoring catches user-facing issues. CI failures stay invisible until someone notices the merge queue is stuck. The metrics and alerts that make pipelines observable.
Static thresholds on error rate produce noisy alerts. Burn-rate alerting flips the question to "are we burning the error budget faster than we can sustain?" — and pages only on real problems.
SBOMs and signed attestations sound like checkboxes until you need to answer "did this artifact come from our pipeline?" The minimum viable supply-chain story we run.
Evergreen posts worth revisiting.