Blue/green sounds simple until your green cluster has a memory leak and you've already sent 50% of traffic there. The guardrails are what make it safe.

On this page

Blue/Green Deployment Guardrails

Blue/green deployments are taught as if they're a flip — old version on blue, new version on green, push the load balancer over, done. In practice the moment between "we sent some traffic to green" and "we know green is healthy" is where the actual interesting things happen. This post is about the guardrails that live in that moment.

We've been running blue/green deploys for three production services for about 18 months. Here's what's prevented the dumb mistakes.

The shape we use #

Both colours are full deployments — same Helm chart, different targetRevision. Traffic routing is via an Istio VirtualService that splits between two Kubernetes services. The split is normally 100/0; during a deploy it goes 95/5, then 90/10, 75/25, 50/50, 25/75, 0/100, with verification at every step.

Each step has hard gates. The pipeline does not advance until the gates clear. Most "automation" stories about blue/green skip this part, but the gates are 80% of the value.

Gate 1: green is up before any traffic moves #

Trivial in theory; bites people in practice. Green's pods can be Running without being Ready. They can be Ready without having warmed their connection pools or compiled their JIT cache. They can be doing both and still fail their first 50 requests because of a config issue that only triggers at runtime.

We require three signals before sending the first 5%:

yaml.yaml

# Step 1 of pipeline
- name: green-readiness
  checks:
    - kubectl rollout status deployment/myapp-green --timeout=180s
    - vegeta attack -duration 60s -rate 5 -targets <(echo "GET https://myapp-green.internal/health")
      | vegeta report
      # require 0 failures, p95 < 200ms
    - check_warm_metrics --service=myapp-green --window=2m
      # custom check: connection pool > 50% utilization, GC pauses < 100ms

The check_warm_metrics part is the one we added after a real incident: green looked healthy in synthetic checks but its DB connection pool was empty because nothing had triggered a real query. The first user request waited for a connection, timed out, and we counted it as a failure. Now we drive 60 seconds of synthetic traffic before any real user sees green.

Gate 2: error rate in green is no worse than blue #

The first wave of real traffic to green is 5%. We measure for 5 minutes. Then we compare:

promql.promql

# Error rate on green
sum(rate(http_requests_total{service="myapp", color="green", code=~"5.."}[5m]))
  / sum(rate(http_requests_total{service="myapp", color="green"}[5m]))

# Error rate on blue (baseline)
sum(rate(http_requests_total{service="myapp", color="blue", code=~"5.."}[5m]))
  / sum(rate(http_requests_total{service="myapp", color="blue"}[5m]))

If green's rate is more than 1.1× blue's rate, the deploy halts and rolls traffic back to 100% blue. We don't auto-promote on equality; we require green to be at least as good.

The 1.1× tolerance was tuned empirically. We started at 1.0× (strict equality) and rolled back too often on noise. We tried 1.5× and missed a real regression once. 1.1× has been stable for a year now.

Gate 3: latency at the same percentiles #

Same shape, different metric:

promql.promql

histogram_quantile(0.95,
  sum by (le) (rate(http_request_duration_seconds_bucket{service="myapp", color="green"}[5m]))
)

Compared to blue. Tolerance for p95: 1.15×. For p99: 1.25× (p99 is noisier).

Gate 4: cost-shaped check #

This is the one most teams skip. A regression that doubles the LLM API calls per request, or doubles the database queries per request, looks fine on error rate and latency for the duration of the canary — until the bill or the DB load catches up. By then you've promoted to 100%.

Our cost-shaped check varies per service. For our payments service, it's external API calls per request. For our retrieval service, it's vector DB queries per request. For the marketing site renderer, it's database read count per request.

If green's cost-shaped metric is more than 1.10× blue's, we halt. This caught a real bug where someone refactored a cache to no-op for a particular query type — the user-facing behaviour was identical but we'd have spent ~$300/day extra in DB load.

Gate 5: saturation #

Green CPU and memory per request must not exceed blue's by more than 1.15×. This catches subtle regressions that don't trip latency under canary load but would under peak load.

We learned this the hard way. We promoted a deploy that looked clean at 50% canary, only to have it tip over at full load that evening because the new code path was 30% more CPU-hungry per request. The error rate canary didn't catch it because we had headroom during the canary window. Saturation gating fixed this.

Pause-at-50% during peak windows #

Our traffic peaks during specific windows. If a deploy reaches the 50% step during one of those windows, the pipeline pauses for 30 minutes instead of advancing. The pause window catches load-dependent regressions that wouldn't show at 50% during off-peak hours.

This is annoying for engineers. It means a Tuesday afternoon deploy can take 90 minutes instead of 20. We accept the trade because the alternative is being surprised by load issues at 100% on a Tuesday evening.

The rollback path #

Every gate failure triggers an automatic rollback to 100% blue. The rollback completes in under 10 seconds (it's an Istio VirtualService update; no pod restarts needed). The pipeline marks the deploy as failed, posts to Slack, and pages the on-call if the failure is at 25% or higher (i.e. real customer impact, even if briefly).

The on-call engineer doesn't have to do anything in the rollback path. The rollback already happened by the time they look at Slack. They look at the dashboard, decide whether the failure was a real regression or a flake, and either re-trigger the deploy or open an incident.

What we don't do #

We don't auto-promote on weekends. Friday afternoon → Monday morning is a no-deploy window unless someone explicitly waives it. Too many incidents in the past involved a deploy that broke something on Friday evening and nobody noticed until Monday.
We don't deploy two blue/green services at once. If service A goes through a deploy and service B starts one at the same time, attribution gets confusing fast. Pipeline locks per service, with a global lock for shared dependencies.
We don't share data between blue and green. Both colours read from and write to the same database. Schema changes are gated separately (expand-then-contract pattern, separate post).

What broke that the gates wouldn't catch #

Two real incidents that snuck through despite the gates:

The first: a deploy that was correct in every way but introduced a memory leak that took 8 hours to manifest. By then we were at 100% green and blue was already drained. Recovery was a forward-fix because we had no easy "go back to last week" path. Now we keep blue's container image hot for 24 hours after a promotion specifically for this case.

The second: a deploy where green's error rate was lower than blue's because green was rejecting a class of malformed requests blue used to accept. Customers used to those malformed requests started seeing 400s. Our gates were happy because errors went down. Now we also alert on a SUDDEN change in error rate (in either direction) during canary windows; a sudden drop is a signal too.

Building this for a new service #

If I were setting up blue/green from scratch tomorrow, I'd do it in this order:

Get blue/green routing working at all. Pick one service, get traffic split working manually with a flag.
Add gate 2 (error rate vs blue). Just that. Don't try to add all gates at once.
Run for two weeks, collect false positives, tune thresholds.
Add latency and saturation gates.
Add the cost-shaped gate after you've identified what cost-shaped means for your service.
Add peak-window pause last.

The trap is to try to set this all up in week one. The thresholds matter and they're learned empirically. Setting them all at once means setting them all wrong.

The thing I'd never skip: the cost-shaped gate. Errors and latency miss the bugs that cost money silently. Once you have a service that calls a metered external API or runs metered DB queries, that gate pays for itself within a quarter.

Best Practices: Blue-Green Deployment Guardrails

Blue/Green Deployment Guardrails

The shape we use #

Gate 1: green is up before any traffic moves #

Gate 2: error rate in green is no worse than blue #

Gate 3: latency at the same percentiles #

Gate 4: cost-shaped check #

Gate 5: saturation #

Pause-at-50% during peak windows #

The rollback path #

What we don't do #

What broke that the gates wouldn't catch #

Building this for a new service #

Stay Updated

How We Stopped Terraform Drift from Surprising On-Call

Real-World RAG Incidents: Lessons from a Production Rollout

More from DevOps

On-Call Without Burnout: Rotations, Runbooks, and Escalation

Feature Flags for Safe Deploys: Decoupling Release From Deploy

Blameless Postmortems: The Template and Facilitation That Works

On-Call Without Burnout: Rotations, Runbooks, and Escalation

Feature Flags for Safe Deploys: Decoupling Release From Deploy

Blameless Postmortems: The Template and Facilitation That Works

Four Signals That Matter: Choosing SLIs Users Actually Feel

External Secrets Operator: One Secrets Workflow Across Clouds

GitHub Actions Reusable Workflows: DRY Pipelines at Org Scale

About Kiril Urbonas

You might have missed

Prompt Engineering Best Practices: Maximizing LLM Performance

Process Management and Monitoring in Linux

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025