We use blue-green for stateful services where canary doesn't fit. The actual mechanics, the data-layer subtleties, and when blue-green isn't the right answer.

On this page

Blue-Green Deployments for Zero-Downtime Releases

Blue-green is one of the oldest deployment patterns and still one of the cleanest for specific use cases. We use it for stateful services where canary doesn't fit cleanly, and for cutovers between major versions. This post is the practical version: when we choose blue-green over alternatives, how the cutover actually works, and where the data layer makes things complicated.

The basics #

Blue-green deployment: run two identical environments. "Blue" is current production. "Green" is the new version. When green is verified healthy, you switch traffic atomically from blue to green. Blue stays running for fast rollback if green has problems.

The atomic switch is the key feature. There's no "some traffic on old, some on new" intermediate state. Every request after the switch hits the new version.

Compare to canary: canary deliberately runs both versions concurrently. Blue-green deliberately doesn't.

When we use blue-green #

Specific cases:

Stateful services where two versions can't coexist. A service with strong session affinity, or that holds in-memory state, or where having two versions during deploy would break the data contract. We can't do canary safely; blue-green's atomic cutover is the right pattern.

Database schema migrations bundled with code. Old code can't talk to new schema; new code can't talk to old schema. Blue-green where green has the new schema, blue has the old, and we cut over atomically.

Major version migrations. v1 → v2 with significant API changes. Blue runs v1; green runs v2; users (or their agents) cut over at known times.

Risky changes where partial rollout would be confusing. A change that touches authentication, payments, or other "either it works or it doesn't" features. Atomic switch reduces the time window of "some users see this; some don't."

When we DON'T use blue-green #

Most regular deploys go through canary instead. Specifically not blue-green for:

Stateless services with normal deploys. Canary is cheaper (you don't run two full environments) and gives you per-request quality data.

Database migrations decoupled from code. Schema changes go in their own deploy ahead of code; code that uses the new schema rolls out separately. Each step works with both old and new data shapes.

Anything where running 2x infrastructure is unaffordable. Blue-green doubles your runtime cost during the deploy window.

How a blue-green cutover actually works #

For a typical service:

Provision green. Same infrastructure as blue (same Kubernetes deployment, same instance count, same database access).
Deploy the new version to green. Standard deployment process. Green is running but receives no production traffic.
Smoke test green. Synthetic requests, internal validation. Make sure green is healthy.
Optionally, route a small fraction of traffic to green for sanity. This is shadowing, not canary — we don't act on the responses, just verify nothing terrible.
Cut over. Update the routing layer to send 100% of traffic to green.
Monitor. Watch error rates and latency for 10-30 minutes.
Tear down blue after a soak period (usually 1-24 hours, depending on confidence).

The cutover step itself is the only "deploy" event from the user's perspective. Before, blue. After, green. No partial state.

Implementation: where the cutover happens #

The "switch traffic from blue to green" mechanics depend on what's in front:

Load balancer with weighted routing. Update routing to 0% blue, 100% green. Most cloud load balancers support this; the change propagates in seconds.

DNS-based switch. Update a Route 53 record. Slower (DNS TTL determines actual cutover time, can be 60+ seconds), and clients with stale DNS see the old version.

Service mesh. Adjust VirtualService weights. Fast and granular.

Kubernetes Service. Two Deployments, the Service's selector switches between them. Atomic for new connections; existing connections stay on the previous Deployment.

We mostly use the load-balancer and service-mesh approaches. DNS-based is too slow for a clean cutover.

The data layer: where it gets hard #

Blue-green works cleanly when blue and green share the same data layer (database, cache, queue). Both versions read and write the same data.

When code AND schema change, things get harder:

Forward-compatible schema migrations. New schema accepts queries from both old code and new code. Old code's queries continue working; new code's queries work too. Then deploy the new code as blue-green; cut over; old code goes away. This is the cleanest pattern but takes more thought.

Two databases, one cutover. Blue uses the old DB; green uses the new DB. Data is replicated from old to new during the deploy window. At cutover, replication stops; green starts using the new DB. Risk: data written to blue during the cutover window is lost or duplicated.

Read-only mode during cutover. For deployments where data consistency matters and replication is hard, we put the service in read-only mode briefly (~minutes), do the cutover, switch the database, come back to read-write. Brief unavailability is the trade for consistency.

We use the first pattern (forward-compatible schema) for most cases. It requires more deploy steps but doesn't require downtime or risk data loss.

Cost: blue-green is expensive during deploy #

Running two environments doubles your infrastructure cost during the deploy window. For a service running on $5,000/month of EC2, blue-green doubles that for the deploy duration.

Mitigations:

Short cutover windows. Don't keep blue and green running for days. Cut over quickly, tear down blue.

Deploy during off-peak. When traffic is lower, capacity needs are lower; running two environments costs less.

Use spot for the new environment. Until cutover, the new environment isn't customer-serving. Spot interruption during pre-cutover is annoying but not catastrophic.

Don't over-provision green. Green should be sized for production traffic, not for headroom. Resist the urge to "make green bigger just in case."

For our shape of services, blue-green deploys cost $50-300 extra per deploy in infrastructure. Real money, but the use cases where blue-green is the right tool (high-stakes deploys) justify the cost.

What we've learned the hard way #

Specific incidents:

Cutover that didn't fully cut over. A misconfigured load balancer rule meant some traffic stayed on blue after the "cutover." We didn't notice for hours because both versions were healthy. Now we verify the cutover by checking blue's traffic count goes to zero post-cutover.

Database connection pool not draining on blue. After cutover, blue still had open connections to the database. When we tore down blue, those connections were terminated abruptly. The database alerted on the connection drop. Now we drain blue's connections gracefully before teardown.

Forgot to clean up green's now-unused infrastructure. A cutover swapped green to be the new "blue"; the old "blue" was supposed to be torn down. We forgot. Discovered three weeks later, $400 of unused EC2 burning. Now there's a TTL on every blue-green deploy: blue auto-tears-down after 24 hours.

Stale DNS caching by clients. A DNS-based cutover left some clients on the old version for hours. They had aggressive DNS caching. Now we use load-balancer-based cutovers exclusively for blue-green.

Combining with feature flags #

Sometimes blue-green isn't enough granularity. Feature flags add a finer-grained layer:

Deploy new code via blue-green (atomic infrastructure cutover)
New behavior is gated behind feature flags
After cutover is verified, gradually enable the new behavior via flags

This gives you the best of both worlds: clean infrastructure cutover, gradual user-visible change. Used together, they reduce risk significantly.

Rollback #

The defining feature of blue-green: rollback is fast.

If green has problems after cutover, switch traffic back to blue. As long as blue hasn't been torn down, this is one routing change away. Time to roll back: under 1 minute typically.

Compare to canary: rolling back canary is also fast (just abort the rollout), but you've still had some traffic hit the bad version. Blue-green: nobody hit green after rollback (the cutover is reversed before tear-down).

Rollback assumes the data layer didn't already change. If the new code wrote data in a new format, going back to old code might break (old code can't read new format). This is why forward-compatible schema design matters — both old and new code can read both old and new data.

What I'd tell a team starting #

Blue-green is for atomic cutovers. When "some users on old, some on new" is bad, blue-green is the answer.

Canary first, blue-green for specific cases. Most deploys benefit from per-request quality data; canary gives you that. Blue-green is for the cases where canary doesn't fit.

Forward-compatible schema migrations. Decouple schema changes from code changes. Both work with both. Deploys (blue-green or otherwise) become simpler.

Verify cutover completion. Don't assume the routing change took effect; check that traffic actually moved.

TTL on blue. Auto-tear-down after a soak period. Forgotten environments waste money.

Feature flags on top. Atomic infrastructure deploy + gradual feature exposure. Best of both.

Blue-green isn't the deploy pattern for everyone or for every change. For the right use cases (stateful services, atomic cutovers, high-stakes changes), it's the cleanest pattern available. For everything else, canary or rolling deploys are usually better fits. The skill is knowing which deploy pattern matches the change you're making.

Blue-Green Deployments: Zero-Downtime Releases

Blue-Green Deployments for Zero-Downtime Releases

The basics #

When we use blue-green #

When we DON'T use blue-green #

How a blue-green cutover actually works #

Implementation: where the cutover happens #

The data layer: where it gets hard #

Cost: blue-green is expensive during deploy #

What we've learned the hard way #

Combining with feature flags #

Rollback #

What I'd tell a team starting #

Stay Updated

A Pragmatic Multi-Region Strategy for Small Teams

Systemd Tricks We Use to Keep Services Boring

More from DevOps

Feature Flags for Safe Deploys: Decoupling Release From Deploy

On-Call Without Burnout: Rotations, Runbooks, and Escalation

Blameless Postmortems: The Template and Facilitation That Works

Feature Flags for Safe Deploys: Decoupling Release From Deploy

On-Call Without Burnout: Rotations, Runbooks, and Escalation

Blameless Postmortems: The Template and Facilitation That Works

Four Signals That Matter: Choosing SLIs Users Actually Feel

External Secrets Operator: One Secrets Workflow Across Clouds

Kustomize Overlays That Scale Across Environments

You might have missed

Prompt Engineering Best Practices: Maximizing LLM Performance

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025

Process Management and Monitoring in Linux

About Kiril Urbonas