A DR runbook nobody reads is worse than no runbook. The shape that finally got ours executed correctly under pressure.
The first time we ran a real region-failover drill, the runbook was useless. Not because it was wrong — it was technically correct — but because it was written like a reference manual. Nobody could execute it under stress. The on-call engineer spent the first 12 minutes scrolling through 14 pages trying to figure out what to do first.
Six months and four drills later, our DR runbook fits on roughly two screens of scrollable content and the engineer who's never seen it before can start executing inside 90 seconds. This post is about how we got there.
Before any redesign, we listed what a DR runbook actually needs to be:
Our previous runbook failed on every one of these. Restructuring around them is what made the difference.
Each runbook follows a fixed three-section structure:
The decision section forces the engineer to pause and confirm before acting. The execution section is the actual work. The verification section closes the loop.
Below is the actual structure for our primary database failover runbook, abbreviated.
## Decision: is this a DR event?
Run all THREE of these. If 2+ are true, this is a DR event:
[ ] /api/health on primary returns 5xx for >5 min in the prod-us-east-1 region
(one-liner: curl -fsS https://prod.example.com/api/health || echo BAD)
[ ] AWS Health Dashboard shows an active issue in us-east-1 for RDS or networking
https://health.aws.amazon.com/health/status
[ ] Datadog dashboard "us-east-1 health" shows error rate >50% across services
https://app.datadoghq.com/dash/integration/aws-east-1-health
If <2 are true, STOP. This is probably a partial outage, not a DR event.
Page the responsible service team and continue normal incident response.
If 2+ are true, continue to Section 2.
The cost of this section is a minute of the engineer's time. The benefit is preventing failover-when-you-shouldn't-have, which is the worst possible mistake.
The execution section is the bulk of the runbook. Every step has:
Example, abbreviated, three of the actual ~12 steps:
## Execution
### Step 1: Acquire failover lock
Run on your laptop:
aws --profile prod-dr ssm put-parameter \
--name /dr/failover-lock \
--value "$(whoami)-$(date +%s)" \
--overwrite
Expected: returns the parameter version number.
If it errors with AccessDenied: you don't have DR-execute permissions.
Page #sec-emergency for break-glass.
### Step 2: Promote read replica in us-west-2
Run:
aws --profile prod-dr rds promote-read-replica \
--region us-west-2 \
--db-instance-identifier app-prod-replica-uw2
Expected: response shows "DBInstanceStatus": "modifying".
Wait until status becomes "available" (poll every 30s):
while true; do
status=$(aws --profile prod-dr rds describe-db-instances \
--region us-west-2 \
--db-instance-identifier app-prod-replica-uw2 \
--query 'DBInstances[0].DBInstanceStatus' --output text)
echo "$(date +%H:%M:%S) $status"
[ "$status" = "available" ] && break
sleep 30
done
Expected duration: 4-7 minutes.
### Step 3: Update Route53 to point app to us-west-2 ALB
Run:
aws --profile prod-dr route53 change-resource-record-sets \
--hosted-zone-id Z123456789ABCD \
--change-batch file://dr/route53-failover-uswest2.json
The change batch file is committed at infra/dr/route53-failover-uswest2.json.
Expected: returns ChangeInfo with Status PENDING.
DNS propagation: ~60 seconds with our 30s TTL.
The pattern is the same for every step. No prose explaining "why" — that's elsewhere. No "you may need to" — every step is definite. No conditionals more complex than "if X then Y else Z."
After the last execution step, verification is a short checklist:
## Verification
[ ] Run `curl -fsS https://prod.example.com/api/health` — expect 200 from us-west-2
(verify by inspecting response header `x-served-by`)
[ ] Datadog dashboard "post-DR health" — error rate < 1% for 5 min
https://...
[ ] DBA confirms write traffic appearing in us-west-2 RDS
(post in #sre-oncall, tag @dba-rotation)
[ ] Announce in #all-engineering and statuspage.io that DR failover is complete
[ ] Schedule post-DR review for next business day morning
The four checkboxes plus the announce step are the entire verification. If you check all five, the failover is done. If you can't check one, escalate.
Our previous runbook had three things that we deleted:
Two things keep the runbooks from rotting:
Quarterly drills. Once a quarter we run the full DR procedure end-to-end against staging (not production — but the runbook itself is identical between staging and prod, so the validation transfers). Anything that doesn't work in the drill becomes a blocking ticket.
Last-tested timestamp at the top. Every runbook starts with:
Last successfully executed end-to-end: 2026-03-12 (drill in staging)
Last reviewed: 2026-04-15
If "last executed" is more than 90 days old, the on-call rotation lead is responsible for scheduling a drill. We track drift.
The runbook quality directly correlates with the quality of the post-incident review. When the runbook is bad, the review is full of "we should have done X" with no clear assignment. When the runbook is good, the review focuses on whether the runbook itself needs updating — and the change lands as a PR before the review meeting ends.
The other surprise: writing a runbook this concretely is harder than writing the meandering reference version. Forcing yourself to specify exactly what command to run and exactly what output to expect requires actually understanding the system. We discovered three configuration assumptions during the rewrite that turned out to be wrong, and fixed them as part of the runbook update.
Pick the scenario that scares you most. Write the runbook for that one specifically. Don't try to write a "general DR runbook" that handles everything; you'll end up with the document you can't execute under pressure.
Test it. Actually run it, in staging if you can, on a real Tuesday afternoon. The first run will reveal everything wrong with the runbook. Fix what's wrong, run it again the following month. Two clean drills and you have a real runbook.
Then write the next one.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
We replaced 47 percentile threshold alerts with 3 SLO burn-rate alerts. The on-call rotation gets paged less and catches more.
We changed a system prompt for what we thought was a tone improvement and broke a customer-critical extraction overnight. The version control and regression tests we built next.
Explore more articles in this category
There are two hard problems in computer science." We've worked on the cache-invalidation one for a while. The patterns that hold up at scale and the ones that look clean and aren't.
We use Step Functions for batch processing, document ingestion, and a few agentic workflows. The patterns that work, the limits we hit, and where we'd reach for something else.
After two years of running Karpenter on production EKS clusters, the NodePool patterns that survived, the ones we replaced, and the tuning that matters.