A DR runbook nobody reads is worse than no runbook. The shape that finally got ours executed correctly under pressure.

On this page

Cloud Disaster Recovery Runbook Design — Production Playbook

The first time we ran a real region-failover drill, the runbook was useless. Not because it was wrong — it was technically correct — but because it was written like a reference manual. Nobody could execute it under stress. The on-call engineer spent the first 12 minutes scrolling through 14 pages trying to figure out what to do first.

Six months and four drills later, our DR runbook fits on roughly two screens of scrollable content and the engineer who's never seen it before can start executing inside 90 seconds. This post is about how we got there.

The criteria that ended up mattering #

Before any redesign, we listed what a DR runbook actually needs to be:

Executable by anyone on rotation, not just the author. The author has context the reader doesn't.
Linear under stress. Branching trees are unreadable when adrenaline is up.
Specific commands, not "verify the database is healthy." What does "healthy" mean exactly?
Tested, with a date stamp on the last test. A runbook that hasn't been executed end-to-end in 6 months is not a runbook; it's a wish.
Self-contained. No "see the architecture diagram in Confluence." If the network is down, you can't see Confluence.

Our previous runbook failed on every one of these. Restructuring around them is what made the difference.

The shape we use now #

Each runbook follows a fixed three-section structure:

Decision (60 seconds): is this actually a DR event?
Execution (linear steps, 8-15 of them, with copy-pasteable commands)
Verification + announce

The decision section forces the engineer to pause and confirm before acting. The execution section is the actual work. The verification section closes the loop.

Below is the actual structure for our primary database failover runbook, abbreviated.

Section 1: Decision (60 seconds)#

code

## Decision: is this a DR event?

Run all THREE of these. If 2+ are true, this is a DR event:

[ ] /api/health on primary returns 5xx for >5 min in the prod-us-east-1 region
    (one-liner: curl -fsS https://prod.example.com/api/health || echo BAD)

[ ] AWS Health Dashboard shows an active issue in us-east-1 for RDS or networking
    https://health.aws.amazon.com/health/status

[ ] Datadog dashboard "us-east-1 health" shows error rate >50% across services
    https://app.datadoghq.com/dash/integration/aws-east-1-health

If <2 are true, STOP. This is probably a partial outage, not a DR event.
Page the responsible service team and continue normal incident response.

If 2+ are true, continue to Section 2.

The cost of this section is a minute of the engineer's time. The benefit is preventing failover-when-you-shouldn't-have, which is the worst possible mistake.

Section 2: Execution #

The execution section is the bulk of the runbook. Every step has:

A number (so people can say "I'm at step 7" on a call)
A single action
The exact command(s) to run
The expected output
What to do if the output doesn't match

Example, abbreviated, three of the actual ~12 steps:

code

## Execution

### Step 1: Acquire failover lock
Run on your laptop:

    aws --profile prod-dr ssm put-parameter \
      --name /dr/failover-lock \
      --value "$(whoami)-$(date +%s)" \
      --overwrite

Expected: returns the parameter version number.

If it errors with AccessDenied: you don't have DR-execute permissions.
Page #sec-emergency for break-glass.

### Step 2: Promote read replica in us-west-2
Run:

    aws --profile prod-dr rds promote-read-replica \
      --region us-west-2 \
      --db-instance-identifier app-prod-replica-uw2

Expected: response shows "DBInstanceStatus": "modifying".

Wait until status becomes "available" (poll every 30s):

    while true; do
      status=$(aws --profile prod-dr rds describe-db-instances \
        --region us-west-2 \
        --db-instance-identifier app-prod-replica-uw2 \
        --query 'DBInstances[0].DBInstanceStatus' --output text)
      echo "$(date +%H:%M:%S) $status"
      [ "$status" = "available" ] && break
      sleep 30
    done

Expected duration: 4-7 minutes.

### Step 3: Update Route53 to point app to us-west-2 ALB
Run:

    aws --profile prod-dr route53 change-resource-record-sets \
      --hosted-zone-id Z123456789ABCD \
      --change-batch file://dr/route53-failover-uswest2.json

The change batch file is committed at infra/dr/route53-failover-uswest2.json.

Expected: returns ChangeInfo with Status PENDING.
DNS propagation: ~60 seconds with our 30s TTL.

The pattern is the same for every step. No prose explaining "why" — that's elsewhere. No "you may need to" — every step is definite. No conditionals more complex than "if X then Y else Z."

Section 3: Verification #

After the last execution step, verification is a short checklist:

code

## Verification

[ ] Run `curl -fsS https://prod.example.com/api/health` — expect 200 from us-west-2
    (verify by inspecting response header `x-served-by`)

[ ] Datadog dashboard "post-DR health" — error rate < 1% for 5 min
    https://...

[ ] DBA confirms write traffic appearing in us-west-2 RDS
    (post in #sre-oncall, tag @dba-rotation)

[ ] Announce in #all-engineering and statuspage.io that DR failover is complete

[ ] Schedule post-DR review for next business day morning

The four checkboxes plus the announce step are the entire verification. If you check all five, the failover is done. If you can't check one, escalate.

What we cut #

Our previous runbook had three things that we deleted:

Long architectural preamble. "The reason we use cross-region replicas is..." — irrelevant during a failover. Lives in a separate architecture document.
Decision trees. "If the issue is X, follow path A. If Y, follow path B. Within path A, if Z..." — under stress, branching becomes unreadable. We collapsed branches by writing separate runbooks for each major scenario instead of one mega-runbook.
Apology language. "Note: this step may take some time" / "Please ensure you have permissions before proceeding." — the runbook isn't being polite, it's being executed. Plain instructions.

How we keep them honest #

Two things keep the runbooks from rotting:

Quarterly drills. Once a quarter we run the full DR procedure end-to-end against staging (not production — but the runbook itself is identical between staging and prod, so the validation transfers). Anything that doesn't work in the drill becomes a blocking ticket.

Last-tested timestamp at the top. Every runbook starts with:

code

Last successfully executed end-to-end: 2026-03-12 (drill in staging)
Last reviewed: 2026-04-15

If "last executed" is more than 90 days old, the on-call rotation lead is responsible for scheduling a drill. We track drift.

What surprised us #

The runbook quality directly correlates with the quality of the post-incident review. When the runbook is bad, the review is full of "we should have done X" with no clear assignment. When the runbook is good, the review focuses on whether the runbook itself needs updating — and the change lands as a PR before the review meeting ends.

The other surprise: writing a runbook this concretely is harder than writing the meandering reference version. Forcing yourself to specify exactly what command to run and exactly what output to expect requires actually understanding the system. We discovered three configuration assumptions during the rewrite that turned out to be wrong, and fixed them as part of the runbook update.

What I'd give a team starting from a bad runbook #

Pick the scenario that scares you most. Write the runbook for that one specifically. Don't try to write a "general DR runbook" that handles everything; you'll end up with the document you can't execute under pressure.

Test it. Actually run it, in staging if you can, on a real Tuesday afternoon. The first run will reveal everything wrong with the runbook. Fix what's wrong, run it again the following month. Two clean drills and you have a real runbook.

Then write the next one.

Production Playbook: Cloud Disaster Recovery Runbook Design

Cloud Disaster Recovery Runbook Design — Production Playbook

The criteria that ended up mattering #

The shape we use now #

Section 1: Decision (60 seconds)#

Section 2: Execution #

Section 3: Verification #

What we cut #

How we keep them honest #

What surprised us #

What I'd give a team starting from a bad runbook #

Stay Updated

Deep Dive: SLO-Based Monitoring for APIs

Field Notes: Prompt Versioning and Regression Testing

More from Cloud

External Secrets Operator: One Secrets Workflow Across Clouds

AWS Graviton Migration: What Broke and What We Saved

Serverless Cold Starts: Measuring and Fixing Them on Lambda

External Secrets Operator: One Secrets Workflow Across Clouds

AWS Graviton Migration: What Broke and What We Saved

Serverless Cold Starts: Measuring and Fixing Them on Lambda

Multi-Region Failover with Route 53: Health Checks and Gotchas

Four Signals That Matter: Choosing SLIs Users Actually Feel

NAT Gateway Costs: The Silent Line Item and How to Cut It

You might have missed

GitOps with Argo CD: Best Practices for 2025

Prompt Engineering Best Practices: Maximizing LLM Performance

Process Management and Monitoring in Linux

About Kiril Urbonas