Replication is the foundation of database HA. What we monitor, how we practice failover, and the gotchas that show up only when you actually fail over.
Postgres replication is one of those things that "just works" — until it doesn't. We've had two real failovers in production over the last two years and a handful of practice drills. The story is: replication is fine most of the time, lag occasionally drifts, and failovers are rarely as clean as the docs suggest. This post is what we monitor, what we drill, and what's bitten us.
Postgres streaming replication ships the write-ahead log (WAL) from the primary to one or more replicas. Replicas replay the WAL to stay in sync with the primary's state. There are two replication modes:
Most teams (including us) run async replication. The trade is real but bounded — you can lose a few seconds of transactions in a worst-case failover. That's usually acceptable; rebuilding 5 seconds of state from upstream sources is cheaper than the latency tax of sync.
Managed services (RDS, Cloud SQL) implement this for you. The principles are the same; just the management is different.
Three primary lag metrics. We watch all three.
pg_stat_replication.replay_lag — time delta between when WAL was generated on the primary and replayed on the replica. The one I check first.
replay_lsn distance — measured in bytes of WAL between primary and replica. Useful when "1 second of lag" is misleading because the workload is bursty.
Application-observable lag — does the read replica return data we just wrote? We run a synthetic that writes a timestamp to the primary, reads it back from the replica, and measures the gap. End-to-end signal that catches both replication delay AND any caching layers between app and DB.
Our thresholds:
replay_lag > 5s for > 1 minute → warningreplay_lag > 30s for > 1 minute → pagereplay_lsn distance growing without recovering → pageWe tuned these from "default thresholds that flagged every routine fluctuation" to "thresholds that fire when something is actually wrong." Took a couple of rounds.
Common causes, in rough order of frequency:
A long-running query on the replica blocking WAL apply. Replicas can serve read queries, but a query holding a lock can stall the WAL replay process. Easy to spot: lag grows monotonically while one query keeps running. Kill the query, lag clears.
Replica disk I/O saturated. WAL replay is I/O-bound. If something else on the replica (a runaway query, a backup) is using the disk, replay falls behind. Iostat tells you.
Primary write spike. A batch job that writes 10x normal volume — replicas have to keep up. Usually clears as the burst ends.
Network blip between primary and replica. Cross-AZ link issues. RDS multi-AZ handles this internally; for cross-region replicas, real cause of intermittent lag.
vacuum_defer_cleanup_age kicking in. Replica's hot_standby_feedback can block primary's vacuum, causing bloat. Worth knowing about but rare for us.
Some patterns that look scary but aren't:
Brief lag spikes during checkpoint. Postgres flushes dirty pages on checkpoint. Brief I/O activity; replicas catch up within seconds.
Constant ~100ms lag. Network round-trip floor. Not a problem.
Lag during backup of primary. Backups create I/O pressure; replicas catch up after.
One replica lagging while others don't. Replica-specific issue. Pull it from read-traffic rotation; investigate; don't panic about replication as a whole.
We drill failover every quarter. The protocol:
Findings over the past 18 months:
server_check_delay was too high. Cut from 60s to 30s; replicas now get re-evaluated quickly after failover.Each drill found one or two issues. None were catastrophic; all were fixable. The point of drilling is that you find these in a controlled setting, not at 3am during a real failover.
A few operational realities:
Failover doesn't always make the old primary become a replica. Sometimes it stays in a "DOWN" state. For RDS multi-AZ, AWS handles this; for self-managed setups, you need explicit logic to demote and re-attach.
Connection draining is your responsibility. The database doesn't care about your in-flight requests during failover. Your app's job is to drain gracefully — finish in-flight requests, refuse new ones, exit. Without graceful drain, in-flight requests fail hard.
Read replicas catching up after failover takes minutes. A new primary that just got promoted has to share WAL with the (now demoted, recovering) old primary. During the recovery window, replicas can have higher lag. Sometimes the right answer is to take a replica out of the read-traffic rotation temporarily.
Synchronous standby promotion is not instant. Even synchronous replicas need a few seconds to be promoted to primary. The "instant failover" pitch is approximate.
For reference:
The synthetic is the most useful piece for confidence. End-to-end signal that the system is doing what you expect, regardless of which metric a particular failure mode would show up in first.
A read replica being promoted to primary needs to handle the primary write load. If the replica was sized smaller (cost optimization), it may struggle until you can resize.
We size primary and replica identically. The cost is real (same hardware × 2+), but a failover at 2am that runs out of CPU on the new primary is a much bigger problem. For our highest-traffic database, we also have a cold standby in another region.
Three lag metrics, not just one. Time-based, byte-based, app-observable. Each catches a different failure mode.
Drill failover quarterly. Even if the managed service "handles it for you." Especially then — the app side is your problem.
Size primary and replica identically. Cost optimization is the wrong place to save money.
Synthetic round-trip is your friend. End-to-end signal beats any individual metric.
Document the drill protocol. Including comms, abort conditions, and findings. Drill goes faster on the second run.
Test the unhappy paths. "What if the replica is also down?" "What if the failover IP is stale?" Tabletops are cheaper than real incidents.
Postgres replication is mostly invisible infrastructure that works fine for years between issues. The discipline is in monitoring it accurately, drilling failover regularly, and fixing the gotchas you discover in controlled settings rather than in production at 3am. The cost of that discipline is small; the cost of skipping it is occasionally enormous.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
A curated list of shell one-liners that earn their place in real ops work — the ones I reach for weekly, not the trick-shot variety.
We adopted Backstage for service catalogs and templates. What works, what was over-engineered for our size, and what we'd do differently.
Explore more articles in this category
Backups are easy. Restores are hard. The quarterly drill we run, what's failed during it, and the discipline that makes "we have backups" actually mean something.
Why Postgres connection limits bite at unexpected times, the pooling layer we put in front, and the pool-mode tradeoffs we learned the hard way.
Provision real cloud resources with Terraform — a VPC, an S3 bucket, and an EC2 instance — using the standard init/plan/apply workflow.