Blue/green sounds simple until your green cluster has a memory leak and you've already sent 50% of traffic there. The guardrails are what make it safe.
Blue/green deployments are taught as if they're a flip — old version on blue, new version on green, push the load balancer over, done. In practice the moment between "we sent some traffic to green" and "we know green is healthy" is where the actual interesting things happen. This post is about the guardrails that live in that moment.
We've been running blue/green deploys for three production services for about 18 months. Here's what's prevented the dumb mistakes.
Both colours are full deployments — same Helm chart, different targetRevision. Traffic routing is via an Istio VirtualService that splits between two Kubernetes services. The split is normally 100/0; during a deploy it goes 95/5, then 90/10, 75/25, 50/50, 25/75, 0/100, with verification at every step.
Each step has hard gates. The pipeline does not advance until the gates clear. Most "automation" stories about blue/green skip this part, but the gates are 80% of the value.
Trivial in theory; bites people in practice. Green's pods can be Running without being Ready. They can be Ready without having warmed their connection pools or compiled their JIT cache. They can be doing both and still fail their first 50 requests because of a config issue that only triggers at runtime.
We require three signals before sending the first 5%:
# Step 1 of pipeline
- name: green-readiness
checks:
- kubectl rollout status deployment/myapp-green --timeout=180s
- vegeta attack -duration 60s -rate 5 -targets <(echo "GET https://myapp-green.internal/health")
| vegeta report
# require 0 failures, p95 < 200ms
- check_warm_metrics --service=myapp-green --window=2m
# custom check: connection pool > 50% utilization, GC pauses < 100ms
The check_warm_metrics part is the one we added after a real incident: green looked healthy in synthetic checks but its DB connection pool was empty because nothing had triggered a real query. The first user request waited for a connection, timed out, and we counted it as a failure. Now we drive 60 seconds of synthetic traffic before any real user sees green.
The first wave of real traffic to green is 5%. We measure for 5 minutes. Then we compare:
# Error rate on green
sum(rate(http_requests_total{service="myapp", color="green", code=~"5.."}[5m]))
/ sum(rate(http_requests_total{service="myapp", color="green"}[5m]))
# Error rate on blue (baseline)
sum(rate(http_requests_total{service="myapp", color="blue", code=~"5.."}[5m]))
/ sum(rate(http_requests_total{service="myapp", color="blue"}[5m]))
If green's rate is more than 1.1× blue's rate, the deploy halts and rolls traffic back to 100% blue. We don't auto-promote on equality; we require green to be at least as good.
The 1.1× tolerance was tuned empirically. We started at 1.0× (strict equality) and rolled back too often on noise. We tried 1.5× and missed a real regression once. 1.1× has been stable for a year now.
Same shape, different metric:
histogram_quantile(0.95,
sum by (le) (rate(http_request_duration_seconds_bucket{service="myapp", color="green"}[5m]))
)
Compared to blue. Tolerance for p95: 1.15×. For p99: 1.25× (p99 is noisier).
This is the one most teams skip. A regression that doubles the LLM API calls per request, or doubles the database queries per request, looks fine on error rate and latency for the duration of the canary — until the bill or the DB load catches up. By then you've promoted to 100%.
Our cost-shaped check varies per service. For our payments service, it's external API calls per request. For our retrieval service, it's vector DB queries per request. For the marketing site renderer, it's database read count per request.
If green's cost-shaped metric is more than 1.10× blue's, we halt. This caught a real bug where someone refactored a cache to no-op for a particular query type — the user-facing behaviour was identical but we'd have spent ~$300/day extra in DB load.
Green CPU and memory per request must not exceed blue's by more than 1.15×. This catches subtle regressions that don't trip latency under canary load but would under peak load.
We learned this the hard way. We promoted a deploy that looked clean at 50% canary, only to have it tip over at full load that evening because the new code path was 30% more CPU-hungry per request. The error rate canary didn't catch it because we had headroom during the canary window. Saturation gating fixed this.
Our traffic peaks during specific windows. If a deploy reaches the 50% step during one of those windows, the pipeline pauses for 30 minutes instead of advancing. The pause window catches load-dependent regressions that wouldn't show at 50% during off-peak hours.
This is annoying for engineers. It means a Tuesday afternoon deploy can take 90 minutes instead of 20. We accept the trade because the alternative is being surprised by load issues at 100% on a Tuesday evening.
Every gate failure triggers an automatic rollback to 100% blue. The rollback completes in under 10 seconds (it's an Istio VirtualService update; no pod restarts needed). The pipeline marks the deploy as failed, posts to Slack, and pages the on-call if the failure is at 25% or higher (i.e. real customer impact, even if briefly).
The on-call engineer doesn't have to do anything in the rollback path. The rollback already happened by the time they look at Slack. They look at the dashboard, decide whether the failure was a real regression or a flake, and either re-trigger the deploy or open an incident.
Two real incidents that snuck through despite the gates:
The first: a deploy that was correct in every way but introduced a memory leak that took 8 hours to manifest. By then we were at 100% green and blue was already drained. Recovery was a forward-fix because we had no easy "go back to last week" path. Now we keep blue's container image hot for 24 hours after a promotion specifically for this case.
The second: a deploy where green's error rate was lower than blue's because green was rejecting a class of malformed requests blue used to accept. Customers used to those malformed requests started seeing 400s. Our gates were happy because errors went down. Now we also alert on a SUDDEN change in error rate (in either direction) during canary windows; a sudden drop is a signal too.
If I were setting up blue/green from scratch tomorrow, I'd do it in this order:
The trap is to try to set this all up in week one. The thresholds matter and they're learned empirically. Setting them all at once means setting them all wrong.
The thing I'd never skip: the cost-shaped gate. Errors and latency miss the bugs that cost money silently. Once you have a service that calls a metered external API or runs metered DB queries, that gate pays for itself within a quarter.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
A real story of removing console-only changes, adding drift detection, and getting Terraform back in charge.
A field report from rolling out retrieval-augmented generation in production, including cache bugs, bad embeddings, and how we fixed them.
Explore more articles in this category
You always have known vulnerabilities. The question is how you triage, patch, and respond. The discipline we run after a few real incidents and a lot of routine work.
Three terms that get mixed up constantly. The actual differences, where each one sits in the request path, when you reach for which, and where the same tool plays all three roles.
Helm gives you a lot of rope. The patterns we used that backfired, the ones we replaced them with, and what to skip if you're starting today.