Backups are easy. Restores are hard. The quarterly drill we run, what's failed during it, and the discipline that makes "we have backups" actually mean something.
We take backups religiously. Every team does. The harder question is: when did you last successfully restore one? For a long time our answer was "uh, we tested it once, when we set it up." After a few near-misses we now drill restores every quarter. This post is what the drills found and the discipline that makes "we have backups" mean something.
Three common ways backups silently fail:
The backup runs but the data is corrupt. A bug in your backup tool, a half-applied snapshot, a stale replica being backed up. The backup file exists; the data inside is unusable. You don't find out until you restore.
The backup runs but the restore path doesn't work. Permissions broken; tooling drifted; the restore script depends on a tool that's no longer installed. The backups are good; you can't get the data back out.
The backup runs, the restore works, but the data is too old. Backup ran nightly; the problem happened at 11 PM; you've lost 24 hours of writes. RPO mismatch with what the business expects.
All three are real. All three look fine from the outside until you actually need to restore.
The first time we ran a real restore drill (about two years ago), we restored a snapshot of one of our larger Postgres databases to a temporary instance and tried to validate it. Three things went wrong:
The IAM role doing the restore lacked permission to decrypt the KMS key the backup was encrypted with. Took 40 minutes to debug. Fix: documented the exact IAM permissions needed for restoration and pre-attached them to a "DR role" used for drills.
The restore took 4 hours, not the 15 minutes we'd guessed. Database was bigger than our intuition; restore time scales with size. Fix: documented realistic RTO based on the actual measured restore time, communicated to product teams (their previous "1 hour RTO" expectation was unrealistic).
The restored database was missing data from the last 6 hours because we'd been relying on the daily snapshot, not on point-in-time recovery (PITR). The PITR option was enabled but nobody had practiced using it. Fix: drill the PITR path specifically.
Each finding was an actionable improvement. None would have been found without actually restoring something.
Every quarter, one engineer:
The whole drill is half a day of one engineer's time. The summary feeds the disaster-recovery runbook and informs RTO/RPO discussions.
Beyond drills, continuous signals:
These are basic. They catch the easy failure modes — the harder ones (corrupt data, broken restore path) only the drills catch.
The "validate the restored data" step is the part most teams skim. Specifically what we check:
Row counts on key tables. Compared to a same-time snapshot of production. Numbers within ~0.1% (allowing for in-flight writes).
Specific known records. A handful of "canary" records that we know exist with specific values. Run a few queries against the restored DB; expected results come back.
Foreign key integrity. ALTER TABLE ... VALIDATE CONSTRAINT on key relationships. Catches "the backup is consistent but referential integrity broke during the dump" cases.
Indexes are present. SELECT count(*) FROM pg_indexes matches production. Sometimes restores skip index rebuilds.
Vacuum / autovacuum has run since restore. Otherwise the restored DB's stats are stale and queries plan badly. We run VACUUM ANALYZE as part of the drill validation.
The validations take ~20 minutes. They've caught issues twice in 18 months — once a partial restore that had right row counts but stale data, once an index that hadn't been rebuilt cleanly.
Point-in-time recovery for managed databases (RDS, Cloud SQL) is a standard feature. You restore a database to any point within the retention window — usually 7-35 days back.
We always include a PITR-based restore in our drills. The mechanics differ slightly from snapshot restore:
First time we did PITR, we hit a "WAL gap" warning we didn't expect — turned out the WAL retention window had been set too short for our snapshot interval. Fix was an RDS parameter change. Would have found out the hard way during a real incident.
For DR purposes, we replicate snapshots to a second region. The drill includes a cross-region restore once a year (not every quarter — the cross-region transfer is slow and expensive).
What we found:
This is exactly the value of drills: surface assumptions you didn't know you were making.
What we keep:
Storage costs are real but small relative to the value. Older snapshots get moved to colder storage tiers (S3 Glacier for archive) to reduce per-GB cost.
A common confusion: replicas are not backups. Replicas mirror the primary in real time, including bugs and mistakes. If someone runs DELETE FROM users WHERE 1=1 on the primary, the replicas immediately delete everything too.
You need both:
We've used backups for the second case twice in the last two years. Both were "a bug or human error deleted/corrupted recent data; we need to selectively restore." The drills made the procedure routine instead of panicky.
Quarterly drill is the minimum. Less frequent and the drill itself becomes a project. More frequent and you don't actually fix the findings.
Validate the restored data, don't just verify the restore completes. Row counts, known records, FK integrity, indexes.
Drill PITR specifically. The snapshot path and the PITR path can fail independently.
Document the actual times. Restoration takes longer than your intuition. Tell stakeholders the real numbers.
Cross-region drill once a year. Bigger investment; finds bigger surprises.
Backups != replication. You need both, for different failure modes.
Test the IAM/permissions path. Half the failures are "we tried to restore and didn't have the right permissions."
"We have backups" is a sentence many teams say with false confidence. "We restored backup X to instance Y on date Z and validated the data" is the version that means something. The discipline to do the latter is small and pays off the first time you actually need it. Which will happen, eventually.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
Explore more articles in this category
Replication is the foundation of database HA. What we monitor, how we practice failover, and the gotchas that show up only when you actually fail over.
Why Postgres connection limits bite at unexpected times, the pooling layer we put in front, and the pool-mode tradeoffs we learned the hard way.
Provision real cloud resources with Terraform — a VPC, an S3 bucket, and an EC2 instance — using the standard init/plan/apply workflow.