A different angle on DR: the planning process — RTO/RPO conversations, dependency mapping, and what we learned about prioritizing what to recover.

On this page

Disaster Recovery Planning: The Process, Not Just the Tools

Most DR content focuses on backup mechanics (how to snapshot RDS, how to set up Velero). The harder part is the planning: deciding what's critical, what trade-offs you're willing to make, and how to communicate this when nobody is panicking. After running through real DR exercises and a couple of actual incidents, this is the planning side: the conversations and frameworks that go before any technical implementation.

The planning conversation #

DR planning has three parts:

What can fail? The threat model.
What can we accept losing? RPO and RTO targets per workload.
What's it worth to prevent? The cost of mitigations vs the cost of disasters.

Most teams skip #1 and #2 and jump to #3, which produces over-engineered solutions for unlikely threats and under-protected critical workloads.

What can fail #

The threat model. Categorize:

Cloud-provider failures:

Single AZ: relatively common, multi-AZ design protects.
Single region: rare but happens. Multi-region or accept-the-downtime.
Multi-region simultaneous: very rare. Prepare for "long downtime" rather than "perfect failover."
Account-level: account compromise, account suspension. Cross-account isolation.

Application-level failures:

Bad deploy that crashes the service: rollback path.
Data corruption from bad code: restore from backup.
Cascading dependency failure: degradation modes.

Operational failures:

Human error (someone runs the wrong command): undo path.
Lost access (admins leave, MFA tokens lost): break-glass procedures.
Misconfigured policy (overly aggressive security policy locks us out): emergency override.

External failures:

Third-party service outage: fallback paths, ideally.
DNS provider issues: secondary DNS.
Major dependency CVE requiring rapid patching: ability to do emergency deploys.

Each category has different probabilities and different responses. A single DR plan can't cover all of them; you need a portfolio of responses.

RPO and RTO per workload #

RPO (Recovery Point Objective): how much data can we afford to lose? RTO (Recovery Time Objective): how long can we afford to be down?

These shouldn't be the same across all workloads. Specific examples from our system:

Workload	RPO	RTO
Customer-facing API	15 min	30 min
Billing system	5 min	1 hour
Internal admin tools	24 hours	4 hours
Analytics dashboards	24 hours	24 hours
ML training pipelines	1 week	1 week

The customer-facing API and billing have tight targets — significant business impact from downtime or data loss. Analytics and ML are looser — short outages don't really hurt.

The conversation isn't "what's the best RPO/RTO we can hit?" It's "what's the worst we can tolerate, and what's it cost to do better?"

The cost-vs-RPO/RTO curve #

For each workload, tighter targets cost more:

24-hour RPO: daily backups. Cheap.
1-hour RPO: continuous replication. Significantly more expensive (compute, storage, network).
5-minute RPO: synchronous multi-region replication. Very expensive (especially network).
Zero RPO: synchronous-multi-AZ + active-active replication. Most expensive; not always achievable depending on consistency requirements.

The sweet spot is usually 15-minute RPO for critical workloads and 24-hour RPO for everything else. Going tighter than 15 minutes adds significant infrastructure cost; usually not worth it unless there's a specific business requirement.

For RTO:

24-hour RTO: cold standby (rebuild from scratch). Cheap.
4-hour RTO: warm standby (infrastructure ready, data current). Moderate cost.
30-minute RTO: hot standby with auto-failover. Expensive.
5-minute RTO: active-active with traffic split. Most expensive.

We use cold standby for non-critical, warm for moderately critical, hot for the customer-facing services. Active-active only where the workload pattern fits.

Dependency mapping #

Before you can plan recovery, you need to know what depends on what.

For each critical service:

What databases does it read/write?
What downstream services does it call?
What upstream services depend on it?
What infrastructure does it need (DNS, secrets, IAM)?
What cloud-provider services does it use?

Visualize as a dependency graph. The "must work for X to work" set is what you need to recover before X.

A common failure: assuming a service is recoverable when its dependencies aren't. "We can fail over the API to us-west-2" is true only if the database, secrets, DNS, and so on are also available there.

Prioritization: what gets recovered first #

In a real incident, you can't recover everything simultaneously. Order matters.

Our recovery priorities:

Tier 1 (recover immediately, within RTO):

Customer-facing critical paths (signin, the main app)
Payment processing
Authentication / authorization

Tier 2 (recover within 4 hours):

Customer-facing non-critical features
Internal operations tools
Background processing

Tier 3 (recover within 24 hours):

Analytics
Reporting
Internal admin tooling that isn't operational

Tier 4 (best-effort):

Stale-data dashboards
Historical analytics
Nice-to-have integrations

This ordering is communicated upfront. When the incident hits, the team doesn't argue about what to recover first — the priority is established.

The tabletop exercise #

Before any real DR drill, we do a tabletop:

Pick a scenario ("us-east-1 is down for 6 hours")
Walk through what we'd do, step by step
Identify gaps in the plan
Document who decides what

Tabletop exercises take 2-3 hours. They surface:

Missing runbooks for specific scenarios
Decisions that nobody is empowered to make ("does the CTO need to approve a region failover?")
Communication gaps ("who tells the customer support team?")
Technical gaps ("we don't have a runbook for promoting the cross-region replica")

Tabletop is much cheaper than real drills and surfaces most of the same issues. We do tabletop every quarter; real drills less often.

The runbooks #

Each priority-tier scenario has a runbook. Structure:

Detection: how do we know this is happening?
Decision: who declares it; what's the criteria?
Communication: who needs to know? What channel?
Execution: step-by-step recovery actions.
Verification: how do we know recovery worked?
Post-incident: what to do after.

The runbook is concrete enough that someone who's never run it before can execute it. We test this — periodically, someone unfamiliar with a runbook attempts it during a drill.

Runbook example, abbreviated:

code

Title: Region Failover - us-east-1 to us-west-2

Detection: Multiple us-east-1 service health checks failing for >5 min,
           AWS Health Dashboard shows event affecting our services

Decision: Engineering manager + platform lead decide. Trigger: above detection
          + decision that recovery is faster than waiting.

Communication: 
  - Internal: #incidents channel; @here ping
  - Customers: status page update; "We're investigating issues..."

Execution:
  1. Promote RDS read replica in us-west-2 to primary (~10 min)
     Command: ./scripts/promote-replica.sh us-west-2 prod-db
  2. Update Route 53 to direct traffic to us-west-2 (~3 min)
     Command: ./scripts/dns-failover.sh us-west-2
  3. Scale up us-west-2 EKS node groups (~5 min)
     Command: ./scripts/scale-region.sh us-west-2 100%
  4. Verify health checks pass; confirm traffic flowing
  
Verification:
  - Customer-facing endpoint returns 200 from us-west-2
  - Error rate < 1% for 15 min
  - No customer-reported issues in support channel
  
Post-incident:
  - Update status page
  - Schedule post-mortem
  - Plan failback timeline (typically next maintenance window)

This is the kind of detail that runs differently when written down vs improvised at 3 AM.

The "what if everything is bad" scenario #

The one we hope never happens but plan for:

Multi-region outage
Account compromise + data destruction
Major external dependency (e.g., DNS provider, identity provider) is down

For these, the response isn't "fail over to a hot standby." It's "this will take hours; let's communicate clearly with stakeholders and execute methodically."

We have a "worst-case" runbook that documents:

What to do when normal recovery is impossible
How to access offline credentials and procedures
Who to contact for what
Order of operations to minimize chaos

This runbook is offline (printed copy in a safe, also encrypted on engineers' laptops). If the cloud is gone, our wiki might be gone too.

What we got wrong initially #

Mistakes we made on early DR planning:

Same RPO/RTO for everything. Set 15-min RPO across the board. Realized later that internal analytics didn't need that; we were paying for replication we didn't need.

Tied DR to specific tools rather than capabilities. Plans referred to specific scripts and commands; when scripts evolved, plans became wrong. Now plans describe outcomes ("promote the read replica") and link to the current tool.

Skipped tabletops for "quick" scenarios. Some scenarios seemed obvious; we didn't run tabletops. When we hit them for real, the obvious plan had gaps. Now: tabletop everything at least once.

Underestimated communication overhead. Recovery is technically half the work; communication (internal, customer, leadership) is the other half. Plans now explicitly include comms steps.

Cost of DR planning #

Cost of the planning side:

Tabletop exercises: ~8 person-hours per quarter
Runbook authoring and review: ongoing, ~2 hours/week across the team
Documentation maintenance: ~1 hour/week
Annual major review (RPO/RTO targets, threat model): ~1 day for the senior team

This is small relative to the technical infrastructure cost. The technical side (warm standby, cross-region replication, etc.) is where the real money goes.

What I'd tell a team starting #

Threat model first; mitigations second. Without knowing what you're protecting against, you'll over-engineer some things and under-protect others.

RPO/RTO per workload, not org-wide. Different workloads, different targets.

Dependency map. What does what depend on? Recovery order follows.

Tabletop before drilling. Catches most of the issues at much lower cost.

Runbooks describe outcomes, not specific commands. Commands change; outcomes are stable.

Plan for communication, not just technical recovery. The hard part is often who tells whom what.

Practice the worst-case scenario. "Everything is bad" is different from "us-east-1 is down."

DR planning is one of those investments that's invisible until needed. The teams that have planned well come through real incidents calmly; the teams that haven't experience them as chaos. The technical infrastructure (backups, replicas, etc.) is the easy part to talk about; the planning discipline is what determines whether the infrastructure actually saves you when called upon.

Disaster Recovery Planning: Building Resilient Infrastructure

Disaster Recovery Planning: The Process, Not Just the Tools

The planning conversation #

What can fail #

RPO and RTO per workload #

The cost-vs-RPO/RTO curve #

Dependency mapping #

Prioritization: what gets recovered first #

The tabletop exercise #

The runbooks #

The "what if everything is bad" scenario #

What we got wrong initially #

Cost of DR planning #

What I'd tell a team starting #

Stay Updated

A Pragmatic Multi-Region Strategy for Small Teams

Systemd Tricks We Use to Keep Services Boring

More from Infrastructure

Backstage Software Catalog: Getting Adoption Past the Demo

Terraform Import at Scale: Bringing Legacy Infra Under Code

Zero-Downtime Postgres Migrations: Expand-Contract in Practice

Backstage Software Catalog: Getting Adoption Past the Demo

Terraform Import at Scale: Bringing Legacy Infra Under Code

Zero-Downtime Postgres Migrations: Expand-Contract in Practice

Postgres Read Replicas: Routing Reads Without Stale-Data Bugs

AWS Graviton Migration: What Broke and What We Saved

Serverless Cold Starts: Measuring and Fixing Them on Lambda

You might have missed

GitOps with Argo CD: Best Practices for 2025

Prompt Engineering Best Practices: Maximizing LLM Performance

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025

About Kiril Urbonas