Backups are easy. Restores are hard. The quarterly drill we run, what's failed during it, and the discipline that makes "we have backups" actually mean something.

On this page

Database Backups: Testing Restores, Not Just Taking Them

We take backups religiously. Every team does. The harder question is: when did you last successfully restore one? For a long time our answer was "uh, we tested it once, when we set it up." After a few near-misses we now drill restores every quarter. This post is what the drills found and the discipline that makes "we have backups" mean something.

Why "we take backups" isn't enough #

Three common ways backups silently fail:

The backup runs but the data is corrupt. A bug in your backup tool, a half-applied snapshot, a stale replica being backed up. The backup file exists; the data inside is unusable. You don't find out until you restore.

The backup runs but the restore path doesn't work. Permissions broken; tooling drifted; the restore script depends on a tool that's no longer installed. The backups are good; you can't get the data back out.

The backup runs, the restore works, but the data is too old. Backup ran nightly; the problem happened at 11 PM; you've lost 24 hours of writes. RPO mismatch with what the business expects.

All three are real. All three look fine from the outside until you actually need to restore.

What the first drill found #

The first time we ran a real restore drill (about two years ago), we restored a snapshot of one of our larger Postgres databases to a temporary instance and tried to validate it. Three things went wrong:

The IAM role doing the restore lacked permission to decrypt the KMS key the backup was encrypted with. Took 40 minutes to debug. Fix: documented the exact IAM permissions needed for restoration and pre-attached them to a "DR role" used for drills.
The restore took 4 hours, not the 15 minutes we'd guessed. Database was bigger than our intuition; restore time scales with size. Fix: documented realistic RTO based on the actual measured restore time, communicated to product teams (their previous "1 hour RTO" expectation was unrealistic).
The restored database was missing data from the last 6 hours because we'd been relying on the daily snapshot, not on point-in-time recovery (PITR). The PITR option was enabled but nobody had practiced using it. Fix: drill the PITR path specifically.

Each finding was an actionable improvement. None would have been found without actually restoring something.

The drill protocol #

Every quarter, one engineer:

Picks a real production database (not a stand-in).
Decides on a target point in time (usually "1 hour ago" — exercises both snapshot and PITR).
Restores it to a temporary instance using the actual production-shaped tooling.
Validates the restored data: row counts match expectation, specific known records exist, integrity checks pass.
Times every step and documents the actual elapsed time per phase.
Tears down the temporary instance.
Writes a one-page summary in the wiki.

The whole drill is half a day of one engineer's time. The summary feeds the disaster-recovery runbook and informs RTO/RPO discussions.

What we monitor on backups themselves #

Beyond drills, continuous signals:

Backup completion notification. Cron-style alert if the daily backup job didn't complete by 7 AM. Catches scheduler failures.
Backup size trend. A backup that's suddenly much smaller is suspicious — could mean a truncated dataset got snapshotted.
Backup age. Most recent successful backup is < 25 hours old (gives 1 hour buffer past the daily cadence).
Cross-region replication lag for snapshots. Snapshots are replicated to a secondary region; if the replication has lagged > 1 day, alert.

These are basic. They catch the easy failure modes — the harder ones (corrupt data, broken restore path) only the drills catch.

What "validation" actually means #

The "validate the restored data" step is the part most teams skim. Specifically what we check:

Row counts on key tables. Compared to a same-time snapshot of production. Numbers within ~0.1% (allowing for in-flight writes).

Specific known records. A handful of "canary" records that we know exist with specific values. Run a few queries against the restored DB; expected results come back.

Foreign key integrity. ALTER TABLE ... VALIDATE CONSTRAINT on key relationships. Catches "the backup is consistent but referential integrity broke during the dump" cases.

Indexes are present. SELECT count(*) FROM pg_indexes matches production. Sometimes restores skip index rebuilds.

Vacuum / autovacuum has run since restore. Otherwise the restored DB's stats are stale and queries plan badly. We run VACUUM ANALYZE as part of the drill validation.

The validations take ~20 minutes. They've caught issues twice in 18 months — once a partial restore that had right row counts but stale data, once an index that hadn't been rebuilt cleanly.

The PITR pathway #

Point-in-time recovery for managed databases (RDS, Cloud SQL) is a standard feature. You restore a database to any point within the retention window — usually 7-35 days back.

We always include a PITR-based restore in our drills. The mechanics differ slightly from snapshot restore:

The restore takes longer (replaying WAL from the snapshot to the target time)
The target instance starts cold and needs to warm up
The replay can fail if WAL is corrupted (rare, but a thing)

First time we did PITR, we hit a "WAL gap" warning we didn't expect — turned out the WAL retention window had been set too short for our snapshot interval. Fix was an RDS parameter change. Would have found out the hard way during a real incident.

Cross-region restoration #

For DR purposes, we replicate snapshots to a second region. The drill includes a cross-region restore once a year (not every quarter — the cross-region transfer is slow and expensive).

What we found:

Cross-region snapshot copy completion time is much longer than we'd assumed (~4 hours for our largest DB). The "we can fail over to us-west-2 in an hour" plan was based on bad assumptions; reality is 5+ hours.
KMS key access across regions needed explicit cross-region policy. AWS supports it; we hadn't configured it correctly.
DNS-driven failover to the restored instance is its own piece of work — the application's database connection string had to be updated.

This is exactly the value of drills: surface assumptions you didn't know you were making.

Backup retention #

What we keep:

Daily snapshots: 30 days rolling
Weekly snapshots: 1 year retention (taken every Sunday)
Monthly snapshots: 7 years (regulatory, for the relevant data)
Cross-region copies: same as primary for the top 5 databases

Storage costs are real but small relative to the value. Older snapshots get moved to colder storage tiers (S3 Glacier for archive) to reduce per-GB cost.

Backups vs replication #

A common confusion: replicas are not backups. Replicas mirror the primary in real time, including bugs and mistakes. If someone runs DELETE FROM users WHERE 1=1 on the primary, the replicas immediately delete everything too.

You need both:

Replicas for read scaling and HA failover
Backups for "we need to roll back to 2 hours ago because we shipped a bug that corrupted data"

We've used backups for the second case twice in the last two years. Both were "a bug or human error deleted/corrupted recent data; we need to selectively restore." The drills made the procedure routine instead of panicky.

What I'd tell a team starting #

Quarterly drill is the minimum. Less frequent and the drill itself becomes a project. More frequent and you don't actually fix the findings.

Validate the restored data, don't just verify the restore completes. Row counts, known records, FK integrity, indexes.

Drill PITR specifically. The snapshot path and the PITR path can fail independently.

Document the actual times. Restoration takes longer than your intuition. Tell stakeholders the real numbers.

Cross-region drill once a year. Bigger investment; finds bigger surprises.

Backups != replication. You need both, for different failure modes.

Test the IAM/permissions path. Half the failures are "we tried to restore and didn't have the right permissions."

"We have backups" is a sentence many teams say with false confidence. "We restored backup X to instance Y on date Z and validated the data" is the version that means something. The discipline to do the latter is small and pays off the first time you actually need it. Which will happen, eventually.

Database Backups — Testing Restores, Not Just Taking Them

Database Backups: Testing Restores, Not Just Taking Them

Why "we take backups" isn't enough #

What the first drill found #

The drill protocol #

What we monitor on backups themselves #

What "validation" actually means #

The PITR pathway #

Cross-region restoration #

Backup retention #

Backups vs replication #

What I'd tell a team starting #

Stay Updated

Helm Chart Anti-Patterns We've Stopped Using

More from Infrastructure

Postgres Replication Lag — Monitoring and Failover Practice

Postgres Connection Pooling — PgBouncer in Front of RDS

Terraform Tutorial — Your First Infrastructure-as-Code Project

Postgres Replication Lag — Monitoring and Failover Practice

Postgres Connection Pooling — PgBouncer in Front of RDS

Terraform Tutorial — Your First Infrastructure-as-Code Project

GitOps Explained — What It Is and Why Teams Adopt It

Helm Chart Anti-Patterns We've Stopped Using

Ansible Tutorial — Configure a Server in 30 Minutes

About Admin

You might have missed

GitOps with Argo CD: Best Practices for 2025

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025

Prompt Engineering Best Practices: Maximizing LLM Performance