We've executed real disaster recoveries twice. The plan that survived contact with reality, and what was wrong about the plans we had before that.
We've executed real disaster recoveries twice. Once it went well. Once it didn't. The difference between the two outcomes was the realism of our preparation. This post is the DR strategy we run now — what we back up, how we test, and what we'd do first if we were starting over.
Three categories of disaster:
These are different problems with different responses. Most "DR plans" assume one and miss the others.
For our customer-facing workloads:
These targets drive the backup strategy. Tighter RPOs cost more (more frequent / synchronous replication). Tighter RTOs cost more (warmer standby infrastructure).
For internal tools and non-critical features, the targets are looser. We don't pretend everything has the same recoverability requirements.
Per category:
Database state: RDS automated backups, daily snapshots, point-in-time recovery (PITR) up to 7 days, monthly snapshots retained 1 year, cross-region snapshot copies for prod.
Object storage state: S3 versioning + cross-region replication for critical buckets. Object Lock on the most important snapshots.
Cluster state: GitOps repo IS the backup for cluster configuration. Velero backs up persistent volumes and any in-cluster state not in Git.
Configuration state: Terraform state (in S3 with versioning, cross-region replicated). Ansible playbooks (in Git).
Secrets: AWS Secrets Manager has built-in versioning. We additionally export an encrypted snapshot weekly to a separate account.
Identity state: Okta is our IdP. Their disaster recovery is theirs to manage; we have offline access codes to recover access if Okta itself is down.
The discipline: every category has an explicit backup strategy. Nothing is "I assume this is fine."
Backups go to a separate AWS account ("DR account") with restricted access:
The reasoning: if the source account is compromised, the attacker can't delete the backups too. We've seen this attack pattern in industry incidents — attackers compromise an account, find and delete backups, then trigger ransomware. Cross-account isolation prevents this.
The DR account costs ~$200/month in storage. Cheap insurance.
Every quarter, we run one DR drill. The pattern:
Recent drill findings:
Each drill finds at least one issue. None has been catastrophic; all needed to be fixed before a real disaster happened.
Real-world recoveries:
us-east-1 had an elevated-error-rate incident affecting RDS specifically. Our writes started timing out. The plan: failover to us-west-2 read replica.
What happened:
Total: about 30 minutes from detection to recovery. Within RTO. Some data loss in the failover (replica was ~10s behind when promoted) — within RPO.
What worked: we had drilled this. The procedure was documented. People knew what to do.
A Terraform PR was merged with a typo that destroyed an EKS cluster (incorrect count value zeroed out the node groups). The cluster came back as 0 nodes; pods failed to schedule.
What happened:
What went wrong:
Most of these issues were found during subsequent drills. At the time, they were all surprises that compounded. Lesson: drilling matters; lessons from drills compound.
The runbooks for common DR scenarios:
Total: ~60-90 min for a full cluster recovery, assuming no drama.
For data corruption:
Total: ~30-60 min depending on database size.
For full instance loss:
For accidentally deleted objects:
For bucket-level issues:
For full account loss:
This is the slowest recovery: estimated 8-24 hours from scratch. We've never done it for real; only drilled.
Our DR investment:
Total: ~$3,000-5,000/month + engineer time. Real money. Compared to the cost of an unrecoverable failure (loss of customer data, trust, and business), the math works.
Plan for three disasters: account compromise, regional outage, data corruption. Each needs different controls.
Cross-account isolation for backups. Single-account backups are vulnerable to account compromise.
Quarterly drills, not just "we have backups." Untested backups are aspirations.
Set RPO and RTO per workload class. Not everything needs the same recoverability.
GitOps for cluster state. Best backup for cluster configuration is Git.
Document restore order. First time you face a real disaster is not the time to figure it out.
Multi-factor delete on critical resources. Reduces the chance of "rm -rf prod" being instantly catastrophic.
DR is one of those investments where the ROI is invisible until it's not. The teams that have practiced come through real incidents in hours; the teams that haven't come through them in days. We've been on both sides; the difference is meaningful. Build the discipline early; the cost of practice is much lower than the cost of unpracticed recovery.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
How a small team moved from single-region risk to a simple active/passive multi-region setup without doubling complexity.
Concrete systemd unit patterns that reduced flakiness: restart policies, resource limits, and structured logs.
Explore more articles in this category
There are two hard problems in computer science." We've worked on the cache-invalidation one for a while. The patterns that hold up at scale and the ones that look clean and aren't.
We use Step Functions for batch processing, document ingestion, and a few agentic workflows. The patterns that work, the limits we hit, and where we'd reach for something else.
After two years of running Karpenter on production EKS clusters, the NodePool patterns that survived, the ones we replaced, and the tuning that matters.