Blue/green is easy for stateless services. We did it for our primary Postgres cluster with 3.2TB of data and ~8k connections. Here's exactly how — and what almost went wrong.

On this page

Blue/Green Deploys for Stateful Services: A Postgres Cutover Story

Blue/green is the canonical zero-downtime deploy pattern for stateless services. Doing it for a 3.2 TB primary Postgres cluster with peak ~8k connections is a different sport. Here's how we cut over from PG14 → PG16 with 27 seconds of write blackout and zero data loss.

The Setup Going In #

Primary: Postgres 14.10, 3.2 TB, 8k peak connections (PgBouncer fronting)
Replica: streaming replication, ~1s lag p95
Apps: ~40 services across an EKS cluster, all using a single PgBouncer endpoint
Workload: ~70% read, ~30% write, mostly OLTP

Why Not Just `pg_upgrade`?#

We considered three options:

Option	Downtime	Risk	Rollback
`pg_upgrade --link` in place	5–15 min	Single shot, no rollback	Restore from backup (hours)
Logical replication blue/green	< 1 min	Months of edge cases possible	Cut traffic back instantly
Read replica promotion	30s–2min	Lose recent writes (RPO > 0)	Hard

Logical replication won despite being the most work. The instant rollback property was non-negotiable for us.

The Architecture #

code

                    ┌──────────────┐
   write traffic ──▶│  PgBouncer   │──▶ BLUE  (PG14 primary, current production)
                    │  (HAProxy)   │         │
                    └──────────────┘         │ logical replication
                            ▲                ▼
                            │            GREEN (PG16, new primary, lagging)
                            │
                       cutover flips
                       upstream pool

The 8-Week Plan #

Weeks 1–3: Stand Up Green and Replicate #

sql.sql

-- On BLUE (publisher)
CREATE PUBLICATION app_pub FOR ALL TABLES;

-- On GREEN (subscriber, after schema dump/restore)
CREATE SUBSCRIPTION app_sub
  CONNECTION 'host=blue port=5432 dbname=app user=replicator'
  PUBLICATION app_pub
  WITH (copy_data = true, slot_name = 'app_sub_slot');

Initial copy of 3.2 TB took 18 hours. Replication caught up to within seconds within another 4 hours.

Gotcha #1: logical replication does not replicate sequences. We snapshotted sequence values and pre-bumped them on green by 100k:

sql.sql

-- On BLUE
SELECT 'SELECT setval(''' || sequencename || ''', ' || last_value + 100000 || ');'
FROM pg_sequences;

Then re-ran a final setval script during cutover.

Gotcha #2: tables without a primary key or replica identity won't replicate updates. We had three. Adding REPLICA IDENTITY FULL works but bloats WAL — better to add a primary key first.

Weeks 4–6: Shadow Traffic #

We dual-routed read queries to green and compared results. The diff tool we wrote was simple but invaluable:

python.python

def shadow_query(query, params):
    blue_result = blue_pool.execute(query, params)
    try:
        green_result = green_pool.execute(query, params)
        if normalize(blue_result) != normalize(green_result):
            log_diff(query, blue_result, green_result)
    except Exception as e:
        log_shadow_failure(query, e)
    return blue_result  # always serve from blue

The shadow flag was off in production by default; we enabled it for 0.1% of traffic, then 1%, then 10% over two weeks.

This caught:

Two queries that relied on PG14-specific behavior in json_to_recordset
One stored procedure that had drifted on green during a manual fix
A timezone difference in default SET (different postgresql.conf)

Week 7: Dress Rehearsal #

We did the full cutover sequence against staging, end to end, twice. Both runs uncovered something — the second one found that our PgBouncer reload script had a 4-second pause we hadn't measured.

Week 8: Production Cutover #

The cutover window itself, total elapsed: 27 seconds of write blackout.

The Cutover Script #

This is the actual sequence (timing in seconds from t=0):

bash.bash

#!/usr/bin/env bash
set -euo pipefail

# t=0: announce
slack_post "🚧 Starting PG cutover. Write blackout begins."

# t=0: stop application writes via PgBouncer PAUSE
psql -h pgbouncer -p 6432 -d pgbouncer -c "PAUSE app;"

# t=2: wait for green to catch up to blue's last LSN
BLUE_LSN=$(psql -h blue -tc "SELECT pg_current_wal_lsn();")
while true; do
  GREEN_LSN=$(psql -h green -tc "SELECT received_lsn FROM pg_stat_subscription;")
  if [[ "$GREEN_LSN" == "$BLUE_LSN" ]]; then break; fi
  sleep 1
done

# t=14: bump sequences on green
psql -h green -f /tmp/sequence-bumps.sql

# t=16: swap PgBouncer upstream
sed -i 's/host=blue/host=green/' /etc/pgbouncer/pgbouncer.ini
psql -h pgbouncer -p 6432 -d pgbouncer -c "RELOAD;"

# t=18: drop subscription on green so it doesn't try to replicate to itself
psql -h green -c "ALTER SUBSCRIPTION app_sub DISABLE;"
psql -h green -c "ALTER SUBSCRIPTION app_sub SET (slot_name = NONE);"
psql -h green -c "DROP SUBSCRIPTION app_sub;"

# t=22: resume traffic
psql -h pgbouncer -p 6432 -d pgbouncer -c "RESUME app;"

# t=27: verify
curl -fsS https://app/health/db || (slack_post "🔥 ROLLBACK" && rollback.sh)

slack_post "✅ Cutover complete. New primary: green (PG16)"

What Almost Went Wrong #

At t=12, the LSN check was looping but not converging. Green was 8MB behind blue and it wasn't catching up. We had budgeted 60 seconds in the runbook before triggering rollback.

Cause: a long-running COPY had just started on blue right before the PAUSE. The PAUSE blocked new transactions but let in-flight ones finish. The COPY took 9 more seconds to commit, then green caught up immediately.

Lesson: the PAUSE time depends on your slowest in-flight transaction. We now run a "no slow queries in flight" precondition check before starting the cutover.

Rollback Plan (Tested, Never Used)#

If the post-cutover health check failed:

bash.bash

# 1. Pause on green
psql -h pgbouncer -p 6432 -d pgbouncer -c "PAUSE app;"

# 2. Re-create subscription on BLUE pointing at GREEN (reverse direction)
#    This is fast because we kept the WAL on blue with a slot
psql -h blue -c "CREATE SUBSCRIPTION app_rollback ..."

# 3. Wait for blue to catch up (any writes that hit green during the failure window)
# 4. Flip PgBouncer back to blue
# 5. Resume

We tested this end-to-end in staging. The reverse cutover took 41 seconds.

Numbers #

Metric	Before	After
Postgres version	14.10	16.2
Primary cutover blackout	n/a	27s
Read traffic blackout	0s	0s
Replication slot retention WAL	4 GB	0
Total project elapsed	n/a	8 weeks
Total engineer-hours	n/a	~140h

Logical replication > physical for blue/green. Physical ties you to identical versions; logical lets you migrate across major versions.
Pre-bump sequences. Cheaper than the alternative.
Shadow traffic for at least a week. Real production queries find bugs your test suite never will.
Practice the rollback in staging. A rollback path you've never executed is not a rollback path.
Budget for the slowest in-flight transaction. Your blackout window is bounded below by it.
Announce in Slack at every step. Future-you reading the timeline will thank you.

When Blue/Green Isn't Worth It #

For databases under ~100 GB with a forgiving maintenance window, plain pg_upgrade --link in 5 minutes is fine. The complexity of logical replication is justified when downtime is more expensive than weeks of engineering.

For us at 3.2 TB and 8k connections, that math was easy. Your numbers may differ — but the questions to ask are the same.

Blue/Green Deploys for Stateful Services: A Postgres Cutover Story

Blue/Green Deploys for Stateful Services: A Postgres Cutover Story

The Setup Going In #

Why Not Just `pg_upgrade`?#

The Architecture #

The 8-Week Plan #

Weeks 1–3: Stand Up Green and Replicate #

Weeks 4–6: Shadow Traffic #

Week 7: Dress Rehearsal #

Week 8: Production Cutover #

The Cutover Script #

What Almost Went Wrong #

Rollback Plan (Tested, Never Used)#

Numbers #

When Blue/Green Isn't Worth It #

Stay Updated

systemd Timers vs Cron: When We Switched and What We Learned

OpenTelemetry Collector Pipelines: Real Configs That Survived Production

More from DevOps

OIDC Federation Beyond GitHub — GitLab, Buildkite, and Generic Providers

Kubernetes Workload Identity — Projected Tokens and OIDC to Cloud IAM

On-Call Without Burnout: Rotations, Runbooks, and Escalation

OIDC Federation Beyond GitHub — GitLab, Buildkite, and Generic Providers

Kubernetes Workload Identity — Projected Tokens and OIDC to Cloud IAM

On-Call Without Burnout: Rotations, Runbooks, and Escalation

Feature Flags for Safe Deploys: Decoupling Release From Deploy

Observability for Edge Functions — Logs, Traces, and Metrics

mTLS for Service-to-Service Auth — Beyond API Keys

About Kiril Urbonas

You might have missed

Prompt Engineering Best Practices: Maximizing LLM Performance

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025

Embedding Models Comparison: Choosing the Right Model for Your Use Case