We use blue-green for stateful services where canary doesn't fit. The actual mechanics, the data-layer subtleties, and when blue-green isn't the right answer.
Blue-green is one of the oldest deployment patterns and still one of the cleanest for specific use cases. We use it for stateful services where canary doesn't fit cleanly, and for cutovers between major versions. This post is the practical version: when we choose blue-green over alternatives, how the cutover actually works, and where the data layer makes things complicated.
Blue-green deployment: run two identical environments. "Blue" is current production. "Green" is the new version. When green is verified healthy, you switch traffic atomically from blue to green. Blue stays running for fast rollback if green has problems.
The atomic switch is the key feature. There's no "some traffic on old, some on new" intermediate state. Every request after the switch hits the new version.
Compare to canary: canary deliberately runs both versions concurrently. Blue-green deliberately doesn't.
Specific cases:
Stateful services where two versions can't coexist. A service with strong session affinity, or that holds in-memory state, or where having two versions during deploy would break the data contract. We can't do canary safely; blue-green's atomic cutover is the right pattern.
Database schema migrations bundled with code. Old code can't talk to new schema; new code can't talk to old schema. Blue-green where green has the new schema, blue has the old, and we cut over atomically.
Major version migrations. v1 → v2 with significant API changes. Blue runs v1; green runs v2; users (or their agents) cut over at known times.
Risky changes where partial rollout would be confusing. A change that touches authentication, payments, or other "either it works or it doesn't" features. Atomic switch reduces the time window of "some users see this; some don't."
Most regular deploys go through canary instead. Specifically not blue-green for:
Stateless services with normal deploys. Canary is cheaper (you don't run two full environments) and gives you per-request quality data.
Database migrations decoupled from code. Schema changes go in their own deploy ahead of code; code that uses the new schema rolls out separately. Each step works with both old and new data shapes.
Anything where running 2x infrastructure is unaffordable. Blue-green doubles your runtime cost during the deploy window.
For a typical service:
The cutover step itself is the only "deploy" event from the user's perspective. Before, blue. After, green. No partial state.
The "switch traffic from blue to green" mechanics depend on what's in front:
Load balancer with weighted routing. Update routing to 0% blue, 100% green. Most cloud load balancers support this; the change propagates in seconds.
DNS-based switch. Update a Route 53 record. Slower (DNS TTL determines actual cutover time, can be 60+ seconds), and clients with stale DNS see the old version.
Service mesh. Adjust VirtualService weights. Fast and granular.
Kubernetes Service. Two Deployments, the Service's selector switches between them. Atomic for new connections; existing connections stay on the previous Deployment.
We mostly use the load-balancer and service-mesh approaches. DNS-based is too slow for a clean cutover.
Blue-green works cleanly when blue and green share the same data layer (database, cache, queue). Both versions read and write the same data.
When code AND schema change, things get harder:
Forward-compatible schema migrations. New schema accepts queries from both old code and new code. Old code's queries continue working; new code's queries work too. Then deploy the new code as blue-green; cut over; old code goes away. This is the cleanest pattern but takes more thought.
Two databases, one cutover. Blue uses the old DB; green uses the new DB. Data is replicated from old to new during the deploy window. At cutover, replication stops; green starts using the new DB. Risk: data written to blue during the cutover window is lost or duplicated.
Read-only mode during cutover. For deployments where data consistency matters and replication is hard, we put the service in read-only mode briefly (~minutes), do the cutover, switch the database, come back to read-write. Brief unavailability is the trade for consistency.
We use the first pattern (forward-compatible schema) for most cases. It requires more deploy steps but doesn't require downtime or risk data loss.
Running two environments doubles your infrastructure cost during the deploy window. For a service running on $5,000/month of EC2, blue-green doubles that for the deploy duration.
Mitigations:
Short cutover windows. Don't keep blue and green running for days. Cut over quickly, tear down blue.
Deploy during off-peak. When traffic is lower, capacity needs are lower; running two environments costs less.
Use spot for the new environment. Until cutover, the new environment isn't customer-serving. Spot interruption during pre-cutover is annoying but not catastrophic.
Don't over-provision green. Green should be sized for production traffic, not for headroom. Resist the urge to "make green bigger just in case."
For our shape of services, blue-green deploys cost $50-300 extra per deploy in infrastructure. Real money, but the use cases where blue-green is the right tool (high-stakes deploys) justify the cost.
Specific incidents:
Cutover that didn't fully cut over. A misconfigured load balancer rule meant some traffic stayed on blue after the "cutover." We didn't notice for hours because both versions were healthy. Now we verify the cutover by checking blue's traffic count goes to zero post-cutover.
Database connection pool not draining on blue. After cutover, blue still had open connections to the database. When we tore down blue, those connections were terminated abruptly. The database alerted on the connection drop. Now we drain blue's connections gracefully before teardown.
Forgot to clean up green's now-unused infrastructure. A cutover swapped green to be the new "blue"; the old "blue" was supposed to be torn down. We forgot. Discovered three weeks later, $400 of unused EC2 burning. Now there's a TTL on every blue-green deploy: blue auto-tears-down after 24 hours.
Stale DNS caching by clients. A DNS-based cutover left some clients on the old version for hours. They had aggressive DNS caching. Now we use load-balancer-based cutovers exclusively for blue-green.
Sometimes blue-green isn't enough granularity. Feature flags add a finer-grained layer:
This gives you the best of both worlds: clean infrastructure cutover, gradual user-visible change. Used together, they reduce risk significantly.
The defining feature of blue-green: rollback is fast.
If green has problems after cutover, switch traffic back to blue. As long as blue hasn't been torn down, this is one routing change away. Time to roll back: under 1 minute typically.
Compare to canary: rolling back canary is also fast (just abort the rollout), but you've still had some traffic hit the bad version. Blue-green: nobody hit green after rollback (the cutover is reversed before tear-down).
Rollback assumes the data layer didn't already change. If the new code wrote data in a new format, going back to old code might break (old code can't read new format). This is why forward-compatible schema design matters — both old and new code can read both old and new data.
Blue-green is for atomic cutovers. When "some users on old, some on new" is bad, blue-green is the answer.
Canary first, blue-green for specific cases. Most deploys benefit from per-request quality data; canary gives you that. Blue-green is for the cases where canary doesn't fit.
Forward-compatible schema migrations. Decouple schema changes from code changes. Both work with both. Deploys (blue-green or otherwise) become simpler.
Verify cutover completion. Don't assume the routing change took effect; check that traffic actually moved.
TTL on blue. Auto-tear-down after a soak period. Forgotten environments waste money.
Feature flags on top. Atomic infrastructure deploy + gradual feature exposure. Best of both.
Blue-green isn't the deploy pattern for everyone or for every change. For the right use cases (stateful services, atomic cutovers, high-stakes changes), it's the cleanest pattern available. For everything else, canary or rolling deploys are usually better fits. The skill is knowing which deploy pattern matches the change you're making.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
How a small team moved from single-region risk to a simple active/passive multi-region setup without doubling complexity.
Concrete systemd unit patterns that reduced flakiness: restart policies, resource limits, and structured logs.
Explore more articles in this category
You always have known vulnerabilities. The question is how you triage, patch, and respond. The discipline we run after a few real incidents and a lot of routine work.
Three terms that get mixed up constantly. The actual differences, where each one sits in the request path, when you reach for which, and where the same tool plays all three roles.
Helm gives you a lot of rope. The patterns we used that backfired, the ones we replaced them with, and what to skip if you're starting today.