We've executed real disaster recoveries twice. The plan that survived contact with reality, and what was wrong about the plans we had before that.

On this page

Disaster Recovery in the Cloud: A Tested Plan

We've executed real disaster recoveries twice. Once it went well. Once it didn't. The difference between the two outcomes was the realism of our preparation. This post is the DR strategy we run now — what we back up, how we test, and what we'd do first if we were starting over.

What "disaster recovery" actually means here #

Three categories of disaster:

Account / region compromise. Someone (us or attackers) destroys our cloud resources. We need to rebuild in a clean account.
Regional outage. AWS us-east-1 has a multi-hour outage. We need to keep running from us-west-2.
Data corruption. A bug or misuse corrupted data. We need to restore from a clean backup.

These are different problems with different responses. Most "DR plans" assume one and miss the others.

RPO and RTO targets #

For our customer-facing workloads:

RPO (Recovery Point Objective): max acceptable data loss. We target ≤ 15 minutes for transactional data, ≤ 24 hours for analytical data.
RTO (Recovery Time Objective): max acceptable downtime. We target ≤ 30 minutes for the customer-facing app, ≤ 4 hours for internal tools.

These targets drive the backup strategy. Tighter RPOs cost more (more frequent / synchronous replication). Tighter RTOs cost more (warmer standby infrastructure).

For internal tools and non-critical features, the targets are looser. We don't pretend everything has the same recoverability requirements.

What we back up #

Per category:

Database state: RDS automated backups, daily snapshots, point-in-time recovery (PITR) up to 7 days, monthly snapshots retained 1 year, cross-region snapshot copies for prod.

Object storage state: S3 versioning + cross-region replication for critical buckets. Object Lock on the most important snapshots.

Cluster state: GitOps repo IS the backup for cluster configuration. Velero backs up persistent volumes and any in-cluster state not in Git.

Configuration state: Terraform state (in S3 with versioning, cross-region replicated). Ansible playbooks (in Git).

Secrets: AWS Secrets Manager has built-in versioning. We additionally export an encrypted snapshot weekly to a separate account.

Identity state: Okta is our IdP. Their disaster recovery is theirs to manage; we have offline access codes to recover access if Okta itself is down.

The discipline: every category has an explicit backup strategy. Nothing is "I assume this is fine."

Cross-account isolation #

Backups go to a separate AWS account ("DR account") with restricted access:

Read access only via specific cross-account roles
No write access from the source accounts
Multi-factor delete on critical S3 buckets
Account is otherwise unused (no production workloads)

The reasoning: if the source account is compromised, the attacker can't delete the backups too. We've seen this attack pattern in industry incidents — attackers compromise an account, find and delete backups, then trigger ransomware. Cross-account isolation prevents this.

The DR account costs ~$200/month in storage. Cheap insurance.

Quarterly DR drills #

Every quarter, we run one DR drill. The pattern:

Pick a category. "This quarter we test region failover." Or "this quarter we test data restoration."
Pick a target. A non-critical service or test environment.
Pretend the disaster has happened. Don't use the original — only the backups.
Time it. Record how long each step takes.
Document what went wrong. Inevitably something does.
Update the runbook.

Recent drill findings:

A new operator we'd installed wasn't documented in the restore-order doc; restoration order broke without it
A service team had stopped backing up an important volume after a refactor (no one noticed for 4 months)
Our restore script had a bug in the cross-region path (we'd only ever tested in-region restore)
Cross-account IAM policies needed updates to allow restoration

Each drill finds at least one issue. None has been catastrophic; all needed to be fixed before a real disaster happened.

The two real disasters we've had #

Real-world recoveries:

Disaster 1: regional latency event (2023)#

us-east-1 had an elevated-error-rate incident affecting RDS specifically. Our writes started timing out. The plan: failover to us-west-2 read replica.

What happened:

We promoted the cross-region read replica to primary (~12 minutes).
Updated services to point at the new endpoint (~3 minutes for DNS propagation + service connection refresh).
~20 minutes of partial degradation (some services were faster to switch than others).
~95% of customer functionality restored within 25 minutes; the rest within an hour.

Total: about 30 minutes from detection to recovery. Within RTO. Some data loss in the failover (replica was ~10s behind when promoted) — within RPO.

What worked: we had drilled this. The procedure was documented. People knew what to do.

Disaster 2: accidental destruction (2022)#

A Terraform PR was merged with a typo that destroyed an EKS cluster (incorrect count value zeroed out the node groups). The cluster came back as 0 nodes; pods failed to schedule.

What happened:

We noticed within ~5 minutes (alerts firing).
Reverted the Terraform PR; ran apply.
Nodes started coming back; pods started scheduling.
~14 hours total to fully restore.

What went wrong:

Some PVs got orphaned during the destruction; reattaching them was manual
Argo CD's state diverged from cluster reality; took time to reconcile
A new CRD wasn't in our restore-order doc; manually fixed
Sealed-secrets controller hadn't restored its key in time; secrets were unreadable

Most of these issues were found during subsequent drills. At the time, they were all surprises that compounded. Lesson: drilling matters; lessons from drills compound.

Specific recovery procedures #

The runbooks for common DR scenarios:

EKS cluster recovery #

Provision a new cluster via Terraform (15 min)
Install Argo CD (5 min)
Point Argo at the GitOps repo (1 min)
Wait for Argo to apply CRDs first (5 min)
Wait for sealed-secrets controller to be ready and have its key (5 min)
Wait for application Deployments to roll out (15 min)
Restore PVs from Velero snapshots (10-30 min depending on volume)
Verify services are healthy

Total: ~60-90 min for a full cluster recovery, assuming no drama.

RDS database recovery #

For data corruption:

Identify the timestamp before corruption
Restore PITR to a new RDS instance with that timestamp
Update services to use new instance (DNS or config change)
Decommission corrupted instance (or keep for forensics)

Total: ~30-60 min depending on database size.

For full instance loss:

Restore from latest snapshot (15-30 min)
Update services to use new instance
Replay any captured WAL beyond the snapshot if possible

S3 data recovery #

For accidentally deleted objects:

Identify deleted objects via CloudTrail
Restore from S3 versioning
Verify integrity

For bucket-level issues:

Restore from cross-region replicated copy
Update applications to use the new region's bucket if needed

IAM / account recovery #

For full account loss:

New AWS account
Re-run Terraform from scratch (with state backups)
Re-bootstrap Argo CD
Re-create users via SSO
Restore data from cross-account backups

This is the slowest recovery: estimated 8-24 hours from scratch. We've never done it for real; only drilled.

Cost reality #

Our DR investment:

Cross-region replication for data: ~$1,200/month
Cross-region snapshots: ~$200/month
DR account (storage + Object Lock): ~$200/month
Velero infrastructure (compute + storage): ~$80/month
Hot standby infrastructure (us-west-2): ~25% of primary cost (sized to handle 100% during failover, idle most of the time)
Engineer time on drills: ~16 hours/quarter

Total: ~$3,000-5,000/month + engineer time. Real money. Compared to the cost of an unrecoverable failure (loss of customer data, trust, and business), the math works.

What I'd tell a team starting #

Plan for three disasters: account compromise, regional outage, data corruption. Each needs different controls.

Cross-account isolation for backups. Single-account backups are vulnerable to account compromise.

Quarterly drills, not just "we have backups." Untested backups are aspirations.

Set RPO and RTO per workload class. Not everything needs the same recoverability.

GitOps for cluster state. Best backup for cluster configuration is Git.

Document restore order. First time you face a real disaster is not the time to figure it out.

Multi-factor delete on critical resources. Reduces the chance of "rm -rf prod" being instantly catastrophic.

DR is one of those investments where the ROI is invisible until it's not. The teams that have practiced come through real incidents in hours; the teams that haven't come through them in days. We've been on both sides; the difference is meaningful. Build the discipline early; the cost of practice is much lower than the cost of unpracticed recovery.

Disaster Recovery in the Cloud: Backup and Recovery Strategies

Disaster Recovery in the Cloud: A Tested Plan

What "disaster recovery" actually means here #

RPO and RTO targets #

What we back up #

Cross-account isolation #

Quarterly DR drills #

The two real disasters we've had #

Disaster 1: regional latency event (2023)#

Disaster 2: accidental destruction (2022)#

Specific recovery procedures #

EKS cluster recovery #

RDS database recovery #

S3 data recovery #

IAM / account recovery #

Cost reality #

What I'd tell a team starting #

Stay Updated

A Pragmatic Multi-Region Strategy for Small Teams

Systemd Tricks We Use to Keep Services Boring

More from Cloud

External Secrets Operator: One Secrets Workflow Across Clouds

AWS Graviton Migration: What Broke and What We Saved

Serverless Cold Starts: Measuring and Fixing Them on Lambda

External Secrets Operator: One Secrets Workflow Across Clouds

AWS Graviton Migration: What Broke and What We Saved

Serverless Cold Starts: Measuring and Fixing Them on Lambda

Multi-Region Failover with Route 53: Health Checks and Gotchas

NAT Gateway Costs: The Silent Line Item and How to Cut It

Terraform Import at Scale: Bringing Legacy Infra Under Code

About Kiril Urbonas

You might have missed

GitOps with Argo CD: Best Practices for 2025

Prompt Engineering Best Practices: Maximizing LLM Performance

Process Management and Monitoring in Linux