A 2 AM incident, the autovacuum settings that caused it, and the parameter changes that prevented the next one. The discipline that took our biggest Postgres host from periodic stalls to steady.

On this page

Postgres Autovacuum: Tuning From Production Stalls

Postgres autovacuum is one of those background subsystems you don't notice until it bites you. Our biggest write-heavy database stalled hard one night — query p99 went from 30 ms to 4 s for about twelve minutes — and the culprit was a single table autovacuum had been losing ground on for weeks. The next morning we changed our defaults. This post is what we changed, why, and the dashboards that now catch the same issue earlier.

What autovacuum actually does #

Postgres uses MVCC: every UPDATE leaves a "dead" row version behind. DELETE is the same — the row is marked dead, not removed. Autovacuum is the background process that reclaims those dead rows so the table doesn't grow unboundedly, the indexes stay efficient, and the visibility map stays accurate for index-only scans.

It also runs ANALYZE periodically to refresh planner statistics, and it does a special "freeze" pass on old rows to prevent transaction ID wraparound. Three jobs, one daemon, mostly invisible.

When autovacuum is keeping up, you never think about it. When it isn't, several things start going wrong at once: table bloat grows, index efficiency drops, planner stats go stale, and eventually one of three failure modes happens — a long ACCESS EXCLUSIVE lock, an anti-wraparound emergency vacuum, or a runaway query plan that scans the bloated heap.

How we landed in trouble #

The table in question — a high-write event stream — was getting ~5,000 UPDATEs per second across a couple of hot rows during peak load. Two settings combined to cause the stall:

Default autovacuum_vacuum_scale_factor = 0.2 — meaning autovacuum waits until 20% of the table is dead tuples before kicking in. On a 200M-row table, that's 40M dead rows of work before the daemon even starts.
Default autovacuum_naptime = 1 min — checks every minute whether any table needs vacuuming. Fine for a quiet database, way too lazy for a hot one.

By the time autovacuum finally fired on that table, it had hours of accumulated work to do. The vacuum took 20+ minutes and held shared locks that competed with foreground queries. Latency spiked. We paged.

What we changed first (table-level overrides)#

The most important lesson: don't tune autovacuum globally to fix one table. Hot tables and quiet tables want different schedules. Use ALTER TABLE ... SET for per-table overrides.

For the hot table:

sql.sql

ALTER TABLE events SET (
  autovacuum_vacuum_scale_factor = 0.02,
  autovacuum_vacuum_threshold = 1000,
  autovacuum_analyze_scale_factor = 0.01,
  autovacuum_vacuum_cost_limit = 2000
);

What changed:

vacuum_scale_factor = 0.02 (was 0.2): vacuum when 2% of rows are dead instead of 20%. Tighter, more frequent, smaller passes.
vacuum_threshold = 1000: ignore the percentage on small tables; start vacuuming after 1k dead tuples. Doesn't matter for the big table; defended against the case where this template gets reused on small tables.
analyze_scale_factor = 0.01: refresh stats more often. Planner stays accurate with the rapidly changing data distribution.
vacuum_cost_limit = 2000 (was 200): autovacuum slows itself down based on a cost budget; raising the limit lets it work faster per pass. Trade: more I/O during the vacuum.

After this, autovacuum on that table runs every 5–10 minutes, each pass takes 30–90 seconds, locks barely register on foreground queries.

What we changed cluster-wide #

A smaller set of changes applied globally:

code

# postgresql.conf
autovacuum_naptime = 15s
autovacuum_max_workers = 6
autovacuum_vacuum_cost_delay = 2ms

naptime = 15s is the big one. The daemon now considers vacuuming every 15 seconds instead of every minute — small change, big effect on hot tables that hit their threshold between checks.

max_workers = 6 (default 3) lets multiple tables vacuum in parallel. With many active tables, the default 3 workers serialize too much.

cost_delay = 2ms (default 2ms, sometimes 20ms on older configs) is the small pause autovacuum inserts to throttle its I/O. We left this at the modern default but explicitly call it out — older clusters often have it at 20ms, which makes vacuum painfully slow under load.

What we measure now #

Three dashboards, checked weekly:

Per-table dead tuple ratio. Query pg_stat_all_tables for n_dead_tup / (n_live_tup + n_dead_tup) — flags any table over 10%. New version of the old "autovacuum is keeping up" question, with hard numbers.

Last autovacuum age per table. now() - last_autovacuum for tables with high write volume. Anything older than ~6 hours on a hot table is suspicious.

Autovacuum worker concurrency. pg_stat_activity filtered for backend_type = 'autovacuum worker'. Should always be < autovacuum_max_workers. If we frequently hit the cap, that's a signal we need more workers or more aggressive per-table thresholds.

We graph all three in the database overview dashboard. Single-pane visibility makes drift obvious instead of having to remember to check.

The bigger lessons #

A few patterns that survived this and a couple of subsequent incidents:

Hot tables need explicit attention. The defaults are designed for "average" tables. Production workloads have outliers. Audit the top 10 by write volume; tune them by hand.

Bloat compounds. A table that's 20% dead today is 40% dead next week if autovacuum can't keep up. The work to clean it up grows faster than linear because indexes degrade with the heap.

Vacuum during traffic, not after. The instinct to "run a manual vacuum at 3 AM when traffic is low" backfires — vacuum during peak is mostly fine if it's small and frequent; vacuum after-the-fact takes an hour and locks things up. Better to bleed the dead rows continuously.

Watch for anti-wraparound vacuums. If autovacuum_freeze_max_age is hit, Postgres triggers an emergency vacuum that ignores all the cost-limit settings and locks the table for the duration. We've never hit this in prod but the post-mortems we've read about it are grim. The dashboards above catch the conditions long before this kicks in.

Cluster-wide settings are a backstop, not the answer. Per-table overrides are how you actually solve this. Inheritance from global defaults is fine; the global defaults should be conservative.

What we didn't change #

A few settings we left at defaults:

maintenance_work_mem stays at the cluster's general value (we use 256 MB). Bumping it speeds vacuum but eats RAM from query planning.
autovacuum_freeze_max_age stays at default 200M. The anti-wraparound emergency only fires if vacuum has been completely failing for a long time. The dashboards catch the conditions much earlier.
Manual vacuum schedules. We considered cron-driven VACUUM ANALYZE jobs. Decided against — autovacuum tuned correctly handles it; manual jobs are a leaky abstraction over the daemon's job.

What to read next #

Adjacent posts in the corpus:

Cloud-native databases: choosing the right database for each workload — when Postgres is and isn't the right tool
Disaster recovery: backup and recovery strategies — the safety net Postgres tuning doesn't replace
Setting up infrastructure monitoring with Prometheus — exporters for Postgres metrics and the dashboards we wire up

Autovacuum is one of those subsystems where the default settings work for the median case and quietly break for the long tail. Tuning takes an hour per important table and pays off for years. The dashboards keep you honest about whether the tuning is actually holding.

Postgres Autovacuum — Tuning From Production Stalls

Postgres Autovacuum: Tuning From Production Stalls

What autovacuum actually does #

How we landed in trouble #

What we changed first (table-level overrides)#

What we changed cluster-wide #

What we measure now #

The bigger lessons #

What we didn't change #

What to read next #

Stay Updated

Fine-Tuning vs RAG vs Long-Context: A Decision Framework With Numbers

Distributed Tracing with OpenTelemetry — What We Ship, What We Skip

More from Infrastructure

Database Backups — Testing Restores, Not Just Taking Them

Postgres Replication Lag — Monitoring and Failover Practice

Postgres Connection Pooling — PgBouncer in Front of RDS

Database Backups — Testing Restores, Not Just Taking Them

Postgres Replication Lag — Monitoring and Failover Practice

Postgres Connection Pooling — PgBouncer in Front of RDS

Terraform Tutorial — Your First Infrastructure-as-Code Project

Helm Chart Anti-Patterns We've Stopped Using

CDN Cache Invalidation — Strategies That Don't Break in Production

About Admin

You might have missed

GitOps with Argo CD: Best Practices for 2025

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025

Prompt Engineering Best Practices: Maximizing LLM Performance