pg_upgrade is fast but takes downtime; logical replication lets you cut over while the old DB still serves traffic. The runbook, the gotchas, and the post-cutover checklist.

On this page

Postgres Logical Replication for Zero-Downtime Major Upgrades

pg_upgrade is the official answer for major version upgrades. It works, it's fast, and it takes downtime — your DB is unavailable from "stop the old server" to "start the new one." For a single instance that's minutes. For a busy production DB with replicas and connection pools to drain, it's an hour-plus maintenance window.

Logical replication offers a different path: stand up the new version alongside the old, replicate writes in real time, cut over with seconds of downtime. The mechanism is good enough that we've used it for our last three major upgrades. This is the runbook and what we learned.

The mental model #

Logical replication uses Postgres's WAL but decodes it into row-level changes (INSERT/UPDATE/DELETE) rather than physical block changes. Critically, the decoded changes can be applied to a different major version. The new DB doesn't need to be a binary clone of the old.

The flow:

Stand up an empty Postgres instance at the new version (e.g., 17 if you're on 15).
Copy the schema across (no data yet).
Create a publication on the old DB (the source) and a subscription on the new DB (the target).
Initial sync copies all existing rows; replication then keeps the new DB current with the old.
When the new DB is caught up and verified, point the application at it. Cutover = a few seconds.

Total downtime in our last upgrade: 4 seconds (the time to swap a DNS record and restart connection pools).

Pre-flight: what logical replication doesn't replicate #

The first surprise: not everything is in the replication stream. You have to handle these manually:

Sequences. Sequence values (the nextval state for SERIAL columns) are not replicated. If you cut over and the new DB's sequence is at 1 while the old was at 1,000,000, the next INSERT collides. Solution: bump sequences manually after sync, before cutover.

Schema changes. DDL (CREATE TABLE, ALTER TABLE, etc.) is not replicated. If you ALTER TABLE on the source during replication, you have to apply the same change on the target manually. We freeze schema changes during the cutover window.

Large objects (lo_* API). Not replicated. If you use them (rare), use file storage instead.

Tables without a primary key. Replication can't UPDATE/DELETE rows without a unique identifier. Either add a PK before starting (preferred) or set REPLICA IDENTITY to FULL (replicates entire row in the WAL — heavy).

For most modern Postgres schemas, only sequences are an issue.

Step 1: prep the source #

On the old DB:

sql.sql

-- Enable logical replication
ALTER SYSTEM SET wal_level = logical;
-- Restart required for wal_level

sql.sql

-- Create a publication for all tables
CREATE PUBLICATION upgrade_pub FOR ALL TABLES;

You can publish specific tables if you're doing partial migration; for a full upgrade, FOR ALL TABLES is cleanest.

Confirm the publication:

sql.sql

SELECT * FROM pg_publication;

Step 2: stand up the target #

Install Postgres 17 (or your target version) on a new host. Create the database with the same name, owner, and encoding as the source.

Dump the schema (no data):

bash.bash

pg_dump -h source-db --schema-only --no-owner --no-privileges mydb > schema.sql

Apply to target:

bash.bash

psql -h target-db mydb < schema.sql

Apply roles/permissions separately if needed.

Step 3: create the subscription #

On the target:

sql.sql

CREATE SUBSCRIPTION upgrade_sub
  CONNECTION 'host=source-db user=replicator dbname=mydb password=...'
  PUBLICATION upgrade_pub;

This starts the initial sync immediately. For a 500GB DB, initial sync took ~6 hours in our case. During this time the source is under additional load (WAL retention + read load from the sync workers). Plan for it.

Monitor sync progress:

sql.sql

SELECT * FROM pg_stat_subscription;
SELECT * FROM pg_subscription_rel WHERE srsubstate != 'r';

srsubstate = 'r' means "ready" (caught up). All tables at r = initial sync complete.

Step 4: monitor lag #

After initial sync, replication continues. Check lag:

sql.sql

SELECT 
  application_name,
  pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn)) AS sent_lag,
  pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)) AS replay_lag
FROM pg_stat_replication;

replay_lag should be near zero in steady state. If it's growing, the target can't keep up — usually because the target hardware is undersized or because a slow query is blocking the apply worker. Investigate.

Step 5: pre-cutover verification #

Before cutting over, verify:

Row counts match on a sample of tables. SELECT count(*) FROM big_table on both. Don't do this on every table or you'll be there forever; pick the 10 tables you care most about.
Indexes exist on the target. The schema dump should have them but double-check.

Sequences are bumped. For each sequence:

sql.sql

-- On source:
SELECT last_value FROM my_sequence;
-- On target:
SELECT setval('my_sequence', <last_value + safety_margin>);

We add a 1000-buffer to account for in-flight transactions during cutover.

Replication lag is < 1 second. If lag is high at cutover time, the application will see missing recent writes.

Step 6: the cutover #

The downtime window. Roughly 5-30 seconds depending on your fleet.

Drain the application connection pool (or scale to zero if you can).
Verify replication lag is 0 — wait briefly if needed.
DISABLE the subscription on the target: ALTER SUBSCRIPTION upgrade_sub DISABLE; This prevents further replication.
Swap the application's DB connection to point at the target (DNS, config map, secrets, whatever).
Restart connection pools / app pods.
Watch for errors.

If the app uses PgBouncer, swap PgBouncer's databases config to point at the target and RELOAD CONFIG. App connections through PgBouncer don't need to restart; pooler handles the swap.

Step 7: post-cutover #

The old DB is now stale; the new DB is authoritative.

Keep the old DB running, read-only, for 24-48 hours as a safety net for fast rollback.
Run vacuum + analyze on the new DB to refresh statistics.
Reset pg_stat_statements if you use it (you start a fresh baseline).
Re-enable monitoring on the new DB (some metrics may need new collector configs).

If something goes wrong in the first hour, you can swap back to the old DB — but any writes that happened on the new DB will be lost. The window for safe rollback is tight; verify aggressively in the first 30 minutes.

What we got wrong #

Underestimating initial sync time. First time we did this, the sync took 14 hours and we'd planned for 4. Sync is bounded by network + target disk write throughput; benchmark on a non-prod copy first.

Forgetting sequences. First cutover: app immediately started failing on INSERT because the sequence was at 1. We rolled back, set the sequences, tried again. Now it's step 5b in our runbook.

Replicating into a DB with existing data. Don't. Target should be empty. If for some reason it's not, the subscription will replicate INSERTs that collide and fail.

Big tables without WHERE clauses. Logical replication does the initial sync as essentially a SELECT * from each table. A 500GB table is a long-running query. Consider replicating in batches by partition if your tables are partitioned.

What we monitor during the migration #

Replication lag (bytes and time).
WAL retention on the source (logical slots prevent WAL recycling; disk can fill).
Subscription apply worker health.
Row counts on critical tables (drift check).
App error rate during cutover.

What to read next #

Postgres connection pooling — PgBouncer in front of RDS — what handles the cutover for connection-heavy apps
Postgres replication lag — monitoring and failover practice — adjacent operational discipline
pg_stat_statements — Postgres query analysis without guessing — the baseline you re-establish after upgrade
Database backup restoration — testing restores — the safety net you should already have

Logical replication isn't free — it adds operational complexity for the duration of the migration. But for a busy production DB where a 90-minute maintenance window is unacceptable, it's the path. The runbook above is what we've crystallized after three upgrades; expect to refine it for your shape.

Postgres Logical Replication for Zero-Downtime Major Upgrades

Postgres Logical Replication for Zero-Downtime Major Upgrades

The mental model #

Pre-flight: what logical replication doesn't replicate #

Step 1: prep the source #

Step 2: stand up the target #

Step 3: create the subscription #

Step 4: monitor lag #

Step 5: pre-cutover verification #

Step 6: the cutover #

Step 7: post-cutover #

What we got wrong #

What we monitor during the migration #

What to read next #

Stay Updated

Kubernetes HPA and VPA — Tuning From Production Pain

LLM Cost Optimization in Production — What Actually Moves the Bill

More from Infrastructure

Observability — Correlating Logs, Metrics, and Traces in Anger

Database Sharding — The Choices We Wish We'd Made Earlier

pg_stat_statements — Postgres Query Analysis Without Guessing

Observability — Correlating Logs, Metrics, and Traces in Anger

Database Sharding — The Choices We Wish We'd Made Earlier

pg_stat_statements — Postgres Query Analysis Without Guessing

Terraform Module Versioning and Shared Registries

Edge Databases for Low-Latency Apps — D1, Turso, Neon Serverless

Postgres Query Plans — Reading Them and the Indexes We Wish We'd Added Sooner

About Kiril Urbonas

You might have missed

GitOps with Argo CD: Best Practices for 2025

Prompt Engineering Best Practices: Maximizing LLM Performance

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025