Replication is the foundation of database HA. What we monitor, how we practice failover, and the gotchas that show up only when you actually fail over.

On this page

Postgres Replication Lag: Monitoring and Failover Practice

Postgres replication is one of those things that "just works" — until it doesn't. We've had two real failovers in production over the last two years and a handful of practice drills. The story is: replication is fine most of the time, lag occasionally drifts, and failovers are rarely as clean as the docs suggest. This post is what we monitor, what we drill, and what's bitten us.

How replication works in practical terms #

Postgres streaming replication ships the write-ahead log (WAL) from the primary to one or more replicas. Replicas replay the WAL to stay in sync with the primary's state. There are two replication modes:

Asynchronous — primary commits without waiting for replicas to confirm. Replicas can fall behind. Standard mode; what we use.
Synchronous — primary waits for at least one replica to confirm before completing a commit. Stronger guarantees but doubles write latency, and a slow replica makes the whole cluster slow.

Most teams (including us) run async replication. The trade is real but bounded — you can lose a few seconds of transactions in a worst-case failover. That's usually acceptable; rebuilding 5 seconds of state from upstream sources is cheaper than the latency tax of sync.

Managed services (RDS, Cloud SQL) implement this for you. The principles are the same; just the management is different.

The lag metrics that matter #

Three primary lag metrics. We watch all three.

pg_stat_replication.replay_lag — time delta between when WAL was generated on the primary and replayed on the replica. The one I check first.

replay_lsn distance — measured in bytes of WAL between primary and replica. Useful when "1 second of lag" is misleading because the workload is bursty.

Application-observable lag — does the read replica return data we just wrote? We run a synthetic that writes a timestamp to the primary, reads it back from the replica, and measures the gap. End-to-end signal that catches both replication delay AND any caching layers between app and DB.

Our thresholds:

replay_lag > 5s for > 1 minute → warning
replay_lag > 30s for > 1 minute → page
replay_lsn distance growing without recovering → page
Synthetic round-trip > 10s → page

We tuned these from "default thresholds that flagged every routine fluctuation" to "thresholds that fire when something is actually wrong." Took a couple of rounds.

What causes lag spikes #

Common causes, in rough order of frequency:

A long-running query on the replica blocking WAL apply. Replicas can serve read queries, but a query holding a lock can stall the WAL replay process. Easy to spot: lag grows monotonically while one query keeps running. Kill the query, lag clears.

Replica disk I/O saturated. WAL replay is I/O-bound. If something else on the replica (a runaway query, a backup) is using the disk, replay falls behind. Iostat tells you.

Primary write spike. A batch job that writes 10x normal volume — replicas have to keep up. Usually clears as the burst ends.

Network blip between primary and replica. Cross-AZ link issues. RDS multi-AZ handles this internally; for cross-region replicas, real cause of intermittent lag.

vacuum_defer_cleanup_age kicking in. Replica's hot_standby_feedback can block primary's vacuum, causing bloat. Worth knowing about but rare for us.

What we don't worry about #

Some patterns that look scary but aren't:

Brief lag spikes during checkpoint. Postgres flushes dirty pages on checkpoint. Brief I/O activity; replicas catch up within seconds.

Constant ~100ms lag. Network round-trip floor. Not a problem.

Lag during backup of primary. Backups create I/O pressure; replicas catch up after.

One replica lagging while others don't. Replica-specific issue. Pull it from read-traffic rotation; investigate; don't panic about replication as a whole.

Failover practice #

We drill failover every quarter. The protocol:

Pick a non-prime hour but during business hours (someone is awake).
Pre-announce in the on-call channel.
Trigger via the managed-service control plane (RDS "reboot with failover").
Time everything: detection latency, app reconnection time, total customer-visible disruption.
Debrief: what surprised us, what to change.

Findings over the past 18 months:

App reconnect time was 90s on the first drill. Investigation: connection pool's stale-detection logic was passive. Fix: aggressive pool recycling on connection errors. Next drill: ~10s reconnect.
One service tried to reconnect to the dead primary IP instead of the DNS name. Hard-coded IP from a config that was never updated. Fix: enforced DNS-only in code review.
PgBouncer didn't recover one of its pools. Bug in our setup — server_check_delay was too high. Cut from 60s to 30s; replicas now get re-evaluated quickly after failover.
A pod still had the read-replica endpoint cached from earlier and continued sending writes there (which failed). Fix: read-only flag enforcement in the SQL adapter.

Each drill found one or two issues. None were catastrophic; all were fixable. The point of drilling is that you find these in a controlled setting, not at 3am during a real failover.

What the docs don't mention #

A few operational realities:

Failover doesn't always make the old primary become a replica. Sometimes it stays in a "DOWN" state. For RDS multi-AZ, AWS handles this; for self-managed setups, you need explicit logic to demote and re-attach.

Connection draining is your responsibility. The database doesn't care about your in-flight requests during failover. Your app's job is to drain gracefully — finish in-flight requests, refuse new ones, exit. Without graceful drain, in-flight requests fail hard.

Read replicas catching up after failover takes minutes. A new primary that just got promoted has to share WAL with the (now demoted, recovering) old primary. During the recovery window, replicas can have higher lag. Sometimes the right answer is to take a replica out of the read-traffic rotation temporarily.

Synchronous standby promotion is not instant. Even synchronous replicas need a few seconds to be promoted to primary. The "instant failover" pitch is approximate.

Monitoring stack #

For reference:

postgres_exporter scrapes Postgres metrics into Prometheus
Grafana dashboard per database with the three lag metrics, plus disk/CPU/connection metrics
Alertmanager routes lag alerts to on-call
Synthetic write-read test runs every 30 seconds via a small cron job
Failover drill log in a wiki, updated each quarter

The synthetic is the most useful piece for confidence. End-to-end signal that the system is doing what you expect, regardless of which metric a particular failure mode would show up in first.

Capacity planning for failover #

A read replica being promoted to primary needs to handle the primary write load. If the replica was sized smaller (cost optimization), it may struggle until you can resize.

We size primary and replica identically. The cost is real (same hardware × 2+), but a failover at 2am that runs out of CPU on the new primary is a much bigger problem. For our highest-traffic database, we also have a cold standby in another region.

What I'd tell a team starting #

Three lag metrics, not just one. Time-based, byte-based, app-observable. Each catches a different failure mode.

Drill failover quarterly. Even if the managed service "handles it for you." Especially then — the app side is your problem.

Size primary and replica identically. Cost optimization is the wrong place to save money.

Synthetic round-trip is your friend. End-to-end signal beats any individual metric.

Document the drill protocol. Including comms, abort conditions, and findings. Drill goes faster on the second run.

Test the unhappy paths. "What if the replica is also down?" "What if the failover IP is stale?" Tabletops are cheaper than real incidents.

Postgres replication is mostly invisible infrastructure that works fine for years between issues. The discipline is in monitoring it accurately, drilling failover regularly, and fixing the gotchas you discover in controlled settings rather than in production at 3am. The cost of that discipline is small; the cost of skipping it is occasionally enormous.

Postgres Replication Lag — Monitoring and Failover Practice

Postgres Replication Lag: Monitoring and Failover Practice

How replication works in practical terms #

The lag metrics that matter #

What causes lag spikes #

What we don't worry about #

Failover practice #

What the docs don't mention #

Monitoring stack #

Capacity planning for failover #

What I'd tell a team starting #

Stay Updated

Bash One-Liners We Actually Use

Internal Developer Platforms — Backstage in Practice

More from Infrastructure

Database Backups — Testing Restores, Not Just Taking Them

Postgres Connection Pooling — PgBouncer in Front of RDS

Terraform Tutorial — Your First Infrastructure-as-Code Project

Database Backups — Testing Restores, Not Just Taking Them

Postgres Connection Pooling — PgBouncer in Front of RDS

Terraform Tutorial — Your First Infrastructure-as-Code Project

GitOps Explained — What It Is and Why Teams Adopt It

Helm Chart Anti-Patterns We've Stopped Using

Ansible Tutorial — Configure a Server in 30 Minutes

About Admin

You might have missed

GitOps with Argo CD: Best Practices for 2025

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025

Prompt Engineering Best Practices: Maximizing LLM Performance