Blue/green is easy for stateless services. We did it for our primary Postgres cluster with 3.2TB of data and ~8k connections. Here's exactly how — and what almost went wrong.
Blue/green is the canonical zero-downtime deploy pattern for stateless services. Doing it for a 3.2 TB primary Postgres cluster with peak ~8k connections is a different sport. Here's how we cut over from PG14 → PG16 with 27 seconds of write blackout and zero data loss.
pg_upgrade?#We considered three options:
| Option | Downtime | Risk | Rollback |
|---|---|---|---|
pg_upgrade --link in place | 5–15 min | Single shot, no rollback | Restore from backup (hours) |
| Logical replication blue/green | < 1 min | Months of edge cases possible | Cut traffic back instantly |
| Read replica promotion | 30s–2min | Lose recent writes (RPO > 0) | Hard |
Logical replication won despite being the most work. The instant rollback property was non-negotiable for us.
┌──────────────┐
write traffic ──▶│ PgBouncer │──▶ BLUE (PG14 primary, current production)
│ (HAProxy) │ │
└──────────────┘ │ logical replication
▲ ▼
│ GREEN (PG16, new primary, lagging)
│
cutover flips
upstream pool
-- On BLUE (publisher)
CREATE PUBLICATION app_pub FOR ALL TABLES;
-- On GREEN (subscriber, after schema dump/restore)
CREATE SUBSCRIPTION app_sub
CONNECTION 'host=blue port=5432 dbname=app user=replicator'
PUBLICATION app_pub
WITH (copy_data = true, slot_name = 'app_sub_slot');
Initial copy of 3.2 TB took 18 hours. Replication caught up to within seconds within another 4 hours.
Gotcha #1: logical replication does not replicate sequences. We snapshotted sequence values and pre-bumped them on green by 100k:
-- On BLUE
SELECT 'SELECT setval(''' || sequencename || ''', ' || last_value + 100000 || ');'
FROM pg_sequences;
Then re-ran a final setval script during cutover.
Gotcha #2: tables without a primary key or replica identity won't replicate updates. We had three. Adding REPLICA IDENTITY FULL works but bloats WAL — better to add a primary key first.
We dual-routed read queries to green and compared results. The diff tool we wrote was simple but invaluable:
def shadow_query(query, params):
blue_result = blue_pool.execute(query, params)
try:
green_result = green_pool.execute(query, params)
if normalize(blue_result) != normalize(green_result):
log_diff(query, blue_result, green_result)
except Exception as e:
log_shadow_failure(query, e)
return blue_result # always serve from blue
The shadow flag was off in production by default; we enabled it for 0.1% of traffic, then 1%, then 10% over two weeks.
This caught:
json_to_recordsetSET (different postgresql.conf)We did the full cutover sequence against staging, end to end, twice. Both runs uncovered something — the second one found that our PgBouncer reload script had a 4-second pause we hadn't measured.
The cutover window itself, total elapsed: 27 seconds of write blackout.
This is the actual sequence (timing in seconds from t=0):
#!/usr/bin/env bash
set -euo pipefail
# t=0: announce
slack_post "🚧 Starting PG cutover. Write blackout begins."
# t=0: stop application writes via PgBouncer PAUSE
psql -h pgbouncer -p 6432 -d pgbouncer -c "PAUSE app;"
# t=2: wait for green to catch up to blue's last LSN
BLUE_LSN=$(psql -h blue -tc "SELECT pg_current_wal_lsn();")
while true; do
GREEN_LSN=$(psql -h green -tc "SELECT received_lsn FROM pg_stat_subscription;")
if [[ "$GREEN_LSN" == "$BLUE_LSN" ]]; then break; fi
sleep 1
done
# t=14: bump sequences on green
psql -h green -f /tmp/sequence-bumps.sql
# t=16: swap PgBouncer upstream
sed -i 's/host=blue/host=green/' /etc/pgbouncer/pgbouncer.ini
psql -h pgbouncer -p 6432 -d pgbouncer -c "RELOAD;"
# t=18: drop subscription on green so it doesn't try to replicate to itself
psql -h green -c "ALTER SUBSCRIPTION app_sub DISABLE;"
psql -h green -c "ALTER SUBSCRIPTION app_sub SET (slot_name = NONE);"
psql -h green -c "DROP SUBSCRIPTION app_sub;"
# t=22: resume traffic
psql -h pgbouncer -p 6432 -d pgbouncer -c "RESUME app;"
# t=27: verify
curl -fsS https://app/health/db || (slack_post "🔥 ROLLBACK" && rollback.sh)
slack_post "✅ Cutover complete. New primary: green (PG16)"
At t=12, the LSN check was looping but not converging. Green was 8MB behind blue and it wasn't catching up. We had budgeted 60 seconds in the runbook before triggering rollback.
Cause: a long-running COPY had just started on blue right before the PAUSE. The PAUSE blocked new transactions but let in-flight ones finish. The COPY took 9 more seconds to commit, then green caught up immediately.
Lesson: the PAUSE time depends on your slowest in-flight transaction. We now run a "no slow queries in flight" precondition check before starting the cutover.
If the post-cutover health check failed:
# 1. Pause on green
psql -h pgbouncer -p 6432 -d pgbouncer -c "PAUSE app;"
# 2. Re-create subscription on BLUE pointing at GREEN (reverse direction)
# This is fast because we kept the WAL on blue with a slot
psql -h blue -c "CREATE SUBSCRIPTION app_rollback ..."
# 3. Wait for blue to catch up (any writes that hit green during the failure window)
# 4. Flip PgBouncer back to blue
# 5. Resume
We tested this end-to-end in staging. The reverse cutover took 41 seconds.
| Metric | Before | After |
|---|---|---|
| Postgres version | 14.10 | 16.2 |
| Primary cutover blackout | n/a | 27s |
| Read traffic blackout | 0s | 0s |
| Replication slot retention WAL | 4 GB | 0 |
| Total project elapsed | n/a | 8 weeks |
| Total engineer-hours | n/a | ~140h |
For databases under ~100 GB with a forgiving maintenance window, plain pg_upgrade --link in 5 minutes is fine. The complexity of logical replication is justified when downtime is more expensive than weeks of engineering.
For us at 3.2 TB and 8k connections, that math was easy. Your numbers may differ — but the questions to ask are the same.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
We migrated 47 cron jobs to systemd timers across our fleet. The mechanical conversion was easy. The interesting parts were the bugs we found that cron had been hiding.
We've been running the OTel Collector at the edge of every cluster for 18 months. The config patterns that lasted, the ones we ripped out, and a few processors that quietly saved us money.
Explore more articles in this category
Every hook on this list caught a bug or a security issue in the last twelve months. The configs are short. The savings have been considerable.
We've been running the OTel Collector at the edge of every cluster for 18 months. The config patterns that lasted, the ones we ripped out, and a few processors that quietly saved us money.
How to write postmortems that lead to real improvements, not just documentation theater. Includes a template and real examples.