A different angle on DR: the planning process — RTO/RPO conversations, dependency mapping, and what we learned about prioritizing what to recover.
Most DR content focuses on backup mechanics (how to snapshot RDS, how to set up Velero). The harder part is the planning: deciding what's critical, what trade-offs you're willing to make, and how to communicate this when nobody is panicking. After running through real DR exercises and a couple of actual incidents, this is the planning side: the conversations and frameworks that go before any technical implementation.
DR planning has three parts:
Most teams skip #1 and #2 and jump to #3, which produces over-engineered solutions for unlikely threats and under-protected critical workloads.
The threat model. Categorize:
Cloud-provider failures:
Application-level failures:
Operational failures:
External failures:
Each category has different probabilities and different responses. A single DR plan can't cover all of them; you need a portfolio of responses.
RPO (Recovery Point Objective): how much data can we afford to lose? RTO (Recovery Time Objective): how long can we afford to be down?
These shouldn't be the same across all workloads. Specific examples from our system:
| Workload | RPO | RTO |
|---|---|---|
| Customer-facing API | 15 min | 30 min |
| Billing system | 5 min | 1 hour |
| Internal admin tools | 24 hours | 4 hours |
| Analytics dashboards | 24 hours | 24 hours |
| ML training pipelines | 1 week | 1 week |
The customer-facing API and billing have tight targets — significant business impact from downtime or data loss. Analytics and ML are looser — short outages don't really hurt.
The conversation isn't "what's the best RPO/RTO we can hit?" It's "what's the worst we can tolerate, and what's it cost to do better?"
For each workload, tighter targets cost more:
The sweet spot is usually 15-minute RPO for critical workloads and 24-hour RPO for everything else. Going tighter than 15 minutes adds significant infrastructure cost; usually not worth it unless there's a specific business requirement.
For RTO:
We use cold standby for non-critical, warm for moderately critical, hot for the customer-facing services. Active-active only where the workload pattern fits.
Before you can plan recovery, you need to know what depends on what.
For each critical service:
Visualize as a dependency graph. The "must work for X to work" set is what you need to recover before X.
A common failure: assuming a service is recoverable when its dependencies aren't. "We can fail over the API to us-west-2" is true only if the database, secrets, DNS, and so on are also available there.
In a real incident, you can't recover everything simultaneously. Order matters.
Our recovery priorities:
Tier 1 (recover immediately, within RTO):
Tier 2 (recover within 4 hours):
Tier 3 (recover within 24 hours):
Tier 4 (best-effort):
This ordering is communicated upfront. When the incident hits, the team doesn't argue about what to recover first — the priority is established.
Before any real DR drill, we do a tabletop:
Tabletop exercises take 2-3 hours. They surface:
Tabletop is much cheaper than real drills and surfaces most of the same issues. We do tabletop every quarter; real drills less often.
Each priority-tier scenario has a runbook. Structure:
The runbook is concrete enough that someone who's never run it before can execute it. We test this — periodically, someone unfamiliar with a runbook attempts it during a drill.
Runbook example, abbreviated:
Title: Region Failover - us-east-1 to us-west-2
Detection: Multiple us-east-1 service health checks failing for >5 min,
AWS Health Dashboard shows event affecting our services
Decision: Engineering manager + platform lead decide. Trigger: above detection
+ decision that recovery is faster than waiting.
Communication:
- Internal: #incidents channel; @here ping
- Customers: status page update; "We're investigating issues..."
Execution:
1. Promote RDS read replica in us-west-2 to primary (~10 min)
Command: ./scripts/promote-replica.sh us-west-2 prod-db
2. Update Route 53 to direct traffic to us-west-2 (~3 min)
Command: ./scripts/dns-failover.sh us-west-2
3. Scale up us-west-2 EKS node groups (~5 min)
Command: ./scripts/scale-region.sh us-west-2 100%
4. Verify health checks pass; confirm traffic flowing
Verification:
- Customer-facing endpoint returns 200 from us-west-2
- Error rate < 1% for 15 min
- No customer-reported issues in support channel
Post-incident:
- Update status page
- Schedule post-mortem
- Plan failback timeline (typically next maintenance window)
This is the kind of detail that runs differently when written down vs improvised at 3 AM.
The one we hope never happens but plan for:
For these, the response isn't "fail over to a hot standby." It's "this will take hours; let's communicate clearly with stakeholders and execute methodically."
We have a "worst-case" runbook that documents:
This runbook is offline (printed copy in a safe, also encrypted on engineers' laptops). If the cloud is gone, our wiki might be gone too.
Mistakes we made on early DR planning:
Same RPO/RTO for everything. Set 15-min RPO across the board. Realized later that internal analytics didn't need that; we were paying for replication we didn't need.
Tied DR to specific tools rather than capabilities. Plans referred to specific scripts and commands; when scripts evolved, plans became wrong. Now plans describe outcomes ("promote the read replica") and link to the current tool.
Skipped tabletops for "quick" scenarios. Some scenarios seemed obvious; we didn't run tabletops. When we hit them for real, the obvious plan had gaps. Now: tabletop everything at least once.
Underestimated communication overhead. Recovery is technically half the work; communication (internal, customer, leadership) is the other half. Plans now explicitly include comms steps.
Cost of the planning side:
This is small relative to the technical infrastructure cost. The technical side (warm standby, cross-region replication, etc.) is where the real money goes.
Threat model first; mitigations second. Without knowing what you're protecting against, you'll over-engineer some things and under-protect others.
RPO/RTO per workload, not org-wide. Different workloads, different targets.
Dependency map. What does what depend on? Recovery order follows.
Tabletop before drilling. Catches most of the issues at much lower cost.
Runbooks describe outcomes, not specific commands. Commands change; outcomes are stable.
Plan for communication, not just technical recovery. The hard part is often who tells whom what.
Practice the worst-case scenario. "Everything is bad" is different from "us-east-1 is down."
DR planning is one of those investments that's invisible until needed. The teams that have planned well come through real incidents calmly; the teams that haven't experience them as chaos. The technical infrastructure (backups, replicas, etc.) is the easy part to talk about; the planning discipline is what determines whether the infrastructure actually saves you when called upon.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
How a small team moved from single-region risk to a simple active/passive multi-region setup without doubling complexity.
Concrete systemd unit patterns that reduced flakiness: restart policies, resource limits, and structured logs.
Explore more articles in this category
Backups are easy. Restores are hard. The quarterly drill we run, what's failed during it, and the discipline that makes "we have backups" actually mean something.
Replication is the foundation of database HA. What we monitor, how we practice failover, and the gotchas that show up only when you actually fail over.
Why Postgres connection limits bite at unexpected times, the pooling layer we put in front, and the pool-mode tradeoffs we learned the hard way.