Argo CD ships your manifests; Argo Rollouts ships them gradually with automated quality gates. The setup, the analysis templates that earn their place, and what we measure.

On this page

Argo Rollouts: Progressive Delivery Beyond Argo CD

Argo CD is good at one thing: making cluster state match Git. It will happily ship a broken deployment as fast as it can pull manifests. Argo Rollouts adds the missing layer — progressive delivery, automated analysis, abort-on-failure. Together they're a complete deploy story; Argo CD alone is half of one. This post is what running Rollouts in production looks like.

What Rollouts adds #

Plain Kubernetes Deployments do rolling updates: take down N pods of the old version, bring up N of the new, repeat until all replicas are on the new version. Fast, but no quality gates between steps. If the new version is broken, you find out when the last user errors come in.

Argo Rollouts replaces the Deployment resource with a Rollout resource. Same shape, plus:

Canary / blue-green strategies with configurable step weights
Analysis templates that run between steps — Prometheus queries, smoke tests, anything that returns success/failure
Auto-abort and roll back if analysis fails
Manual pause + promote controls when you want a human in the loop
A nice dashboard showing which rollout step you're at

The Rollout custom resource is the entry point; everything else hangs off it.

A typical rollout #

A Rollout we use for one of our HTTP APIs:

yaml.yaml

apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: api
spec:
  replicas: 6
  selector:
    matchLabels:
      app: api
  template:
    # standard Pod spec
    ...
  strategy:
    canary:
      maxSurge: "25%"
      maxUnavailable: 0
      steps:
        - setWeight: 10
        - pause: { duration: 5m }
        - analysis:
            templates:
              - templateName: success-rate
              - templateName: latency-p99
        - setWeight: 25
        - pause: { duration: 10m }
        - analysis:
            templates:
              - templateName: success-rate
              - templateName: latency-p99
        - setWeight: 50
        - pause: { duration: 10m }
        - setWeight: 100

Flow:

Push a new image; Rollout starts.
10% of pods replaced with new version.
5-minute pause.
Analysis: query Prometheus for success rate + p99 latency on canary vs stable. If either fails the threshold, abort.
If analysis passes: bump to 25%, repeat. Then 50%. Then 100%.

Total time: ~30 minutes for a typical rollout. Slower than kubectl set image but with quality gates between each step.

Analysis templates that earn their place #

This is the part most teams under-invest in. An analysis template defines a metric, a query, and a success condition. Rollouts queries it during the rollout's analysis steps.

The two we use universally:

success-rate.yaml:

yaml.yaml

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: success-rate
spec:
  args:
    - name: service-name
  metrics:
    - name: success-rate
      interval: 30s
      successCondition: result[0] >= 0.99
      failureLimit: 3
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            sum(rate(http_requests_total{
              service="{{args.service-name}}",
              status!~"5..",
              version=~".*canary.*"
            }[2m]))
            /
            sum(rate(http_requests_total{
              service="{{args.service-name}}",
              version=~".*canary.*"
            }[2m]))

Reads as: every 30 seconds, query Prometheus for "non-5xx rate / total rate on the canary." If it drops below 99% for 3 consecutive checks (failureLimit), abort.

The version=~".*canary.*" is the key — comparing only canary traffic, not aggregated across versions. Without that filter, the stable version's traffic drowns out canary errors.

latency-p99.yaml:

yaml.yaml

apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: latency-p99
spec:
  metrics:
    - name: latency-p99
      interval: 30s
      successCondition: result[0] < 1.0
      failureLimit: 3
      provider:
        prometheus:
          address: http://prometheus.monitoring:9090
          query: |
            histogram_quantile(0.99,
              sum(rate(http_request_duration_seconds_bucket{
                service="{{args.service-name}}",
                version=~".*canary.*"
              }[2m])) by (le)
            )

P99 latency on the canary, alerts if > 1 second.

We also have a couple of service-specific ones (a synthetic that hits a critical endpoint, a check that the canary's writing to the right metric topic). The general pattern: one or two universal analyses + a couple service-specific ones.

Pod labels for traffic shaping #

The canary pattern relies on the service mesh (or ingress, or whatever's in front) routing traffic to the canary based on pod labels. Rollouts manages a separate set of pods labeled version: canary; your Service or VirtualService weights traffic between version: stable and version: canary based on the rollout step.

For Istio, the integration is built-in — Rollouts manages the VirtualService weights. For Nginx ingress, plain Kubernetes Services, or other routers, Rollouts can manage them too via specific provider configs.

The metrics queries in your analyses also filter by these labels — that's why the version=~".*canary.*" selector appears in the queries above. If you don't filter, you compare against the whole service and miss the canary signal.

Rollouts we've blocked #

Real examples where Rollouts saved a bad deploy from reaching 100% of users:

Memory leak in a Node service. New version had a closure that kept references; memory grew steadily. Analysis didn't catch this on the latency template (latency was fine for the first 20 minutes). Caught it on a custom analysis we added later — "memory growth rate" — which is now in the standard template set for that service.

Database connection pool misconfigured. New version had a higher pool size; old version's connections were still open. Result: more concurrent DB connections, some timing out. P99 latency rose; the analysis aborted at step 2 (25% weight). Fixed the config, retried.

Successful-2xx-rate looks fine but content is wrong. A bug in a response handler returned 200 with empty body. Latency was fine; status was fine; users were broken. The analysis didn't catch it because we hadn't built a content-correctness check. Added one. This is the kind of mistake that drives template improvements.

The pattern: every aborted deploy teaches you something. The deploys that complete and silently regress to 100% rollout teach you nothing.

When Rollouts isn't worth it #

Honest list:

Internal-only services with one user (the team). A staging deploy that 5 engineers use doesn't need a 30-minute canary. kubectl set image is fine.

Services with very low traffic. If the canary at 10% gets 3 requests per minute, your analysis is statistically meaningless. Either skip canary or use blue-green instead (full switch with smoke tests).

Services without good metrics. If you can't reliably measure success rate and latency, you can't analyze a rollout. Fix observability first.

One-off scripts and batch jobs. Run-to-completion workloads don't fit the Rollout model.

Operating Rollouts #

The mechanical setup:

Deploy the Argo Rollouts controller (one Helm chart, one CRD set).
Replace your Deployments with Rollouts (similar spec, swap kind: Deployment for kind: Rollout).
Wire your service mesh / ingress / Service to read the canary labels Rollouts emits.
Write AnalysisTemplates for your standard metrics.

Per service: a working canary rollout takes maybe a day to set up the first time, an hour for subsequent services. The cost is real but bounded.

For ongoing operations: the Argo Rollouts dashboard shows in-progress rollouts; kubectl argo rollouts list does the same from the CLI. Aborted rollouts page on-call; we have a runbook for "what to do when a canary aborts."

What we don't bother with #

A few features we've left alone:

Experiment resources for full A/B testing with traffic split based on user attributes. Useful for some teams; we have a separate experimentation platform.

Header-based routing for canary (route only certain users to canary). Powerful but adds complexity. We use weighted routing.

Blue-green strategy. We use canary almost exclusively. Blue-green only fits for stateful workloads where two versions can't coexist, and we don't have many of those.

What to read next #

GitOps with Argo CD: automating Kubernetes deployments — the deployment layer Rollouts sits on top of
Canary releases — a gradual rollout strategy — the general pattern Rollouts implements
Burn-rate alerting — the SLO discipline that prevents alert fatigue — the alerting flip-side of analysis templates
Kubernetes autoscaling: HPA, VPA, and cluster autoscaler — scaling under load is adjacent

Argo Rollouts is the missing piece between "Argo CD applied my manifest" and "my users got a working deploy." Once you have good metrics and a few analysis templates, the gate between bad deploys and your users is automated. The teams that lean on this hardest are the ones who ship multiple times per day; even at lower velocity it pays back.

Argo Rollouts — Progressive Delivery Beyond Argo CD

Argo Rollouts: Progressive Delivery Beyond Argo CD

What Rollouts adds #

A typical rollout #

Analysis templates that earn their place #

Pod labels for traffic shaping #

Rollouts we've blocked #

When Rollouts isn't worth it #

Operating Rollouts #

What we don't bother with #

What to read next #

Stay Updated

eBPF Tools for Everyday Ops — bpftrace Patterns We Use

Postgres Query Plans — Reading Them and the Indexes We Wish We'd Added Sooner

More from DevOps

OIDC Federation Beyond GitHub — GitLab, Buildkite, and Generic Providers

Kubernetes Workload Identity — Projected Tokens and OIDC to Cloud IAM

On-Call Without Burnout: Rotations, Runbooks, and Escalation

OIDC Federation Beyond GitHub — GitLab, Buildkite, and Generic Providers

Kubernetes Workload Identity — Projected Tokens and OIDC to Cloud IAM

On-Call Without Burnout: Rotations, Runbooks, and Escalation

Feature Flags for Safe Deploys: Decoupling Release From Deploy

Shadow Testing and Canary Releases for LLM Changes

mTLS for Service-to-Service Auth — Beyond API Keys

You might have missed

Prompt Engineering Best Practices: Maximizing LLM Performance

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025

Embedding Models Comparison: Choosing the Right Model for Your Use Case

About Kiril Urbonas