We run our app in two AWS regions for failover. The hard parts aren't the deployment — they're data consistency, traffic shifting, and the assumptions that break when "primary" is suddenly the wrong region.
We run our user-facing app in two AWS regions (us-east-1 and us-west-2) for failover. The deployment story is straightforward; the rest of multi-region is where the real complexity lives. This post is what we've learned in roughly two years of running it.
Two real things:
Things multi-region does NOT buy you:
Our setup:
In normal operation, ~95% of traffic goes to us-east-1 (closer to most users). 5% goes to us-west-2 to keep it warm and to detect issues.
Multi-region compute is easy. Multi-region data is the actual problem.
Three approaches we considered:
Single-region primary, read replicas elsewhere. What we run. Writes go to the primary; reads can use the replica. During a primary failure, we promote the replica.
Active-active multi-master. Both regions write. Conflict resolution required. We don't do this — for our shape of data (mostly user-owned, low cross-region conflict potential), the operational complexity wasn't worth it.
Per-region partitioned data. Users sharded by region (e.g., EU users in eu-west-1). Different problem entirely. We don't have meaningful regional partitioning to exploit.
The trade with single-primary: writes always go to one region. From us-west-2, a write incurs the cross-region latency (~70ms). For our workload, write latency dominated by user perception (post-and-redirect flows) rather than by raw write speed; the 70ms is acceptable.
The failover plan:
We've practiced this in game days. Each step has gotchas:
Replica promotion takes time. RDS cross-region read replica promotion is ~5-15 minutes. During this window, writes are unavailable.
Endpoint switching. Services connect via a Postgres endpoint. We use a DNS-based endpoint that we update during failover. Most services pick up the change within 30s (DNS TTL), but some have stale connections that hang. We added connection-recycling on errors.
Replica lag at the moment of failure. If the replica was 15s behind when we cut over, those 15 seconds of writes are lost. For our app this is acceptable (no payments, no critical state at that level); for a bank it wouldn't be.
Failback. Going back to the original primary after recovery is its own dance. We typically don't fail back immediately — we run on the secondary until a planned maintenance window.
We can do a planned failover in ~10 minutes. An unplanned one takes longer because of the time to confirm "yes, this is real, the region is down" before triggering the switch.
Three approaches available:
Latency-based routing. Route 53 sends users to the lowest-latency healthy region. Good for normal operation. Doesn't load-balance — if the closer region is healthy but overloaded, users still go there.
Failover routing. Primary region is preferred; secondary used only when primary fails health checks. Simpler to reason about. We used this for the first year.
Geo-routing. Specific countries → specific regions. Useful if you have data residency requirements or distinct user populations.
We're now on a hybrid: latency-based for active-active routing during normal operation, with failover-style behavior baked into the health checks. If a region goes red, all traffic shifts to the other.
The health check is what triggers failover. Get it wrong and you either:
Our health check:
Endpoint: https://app.example.com/health
Expected status: 200
Body match: "ok"
Interval: 30s
Failure threshold: 3
The /health endpoint hits a deep-stack check (database, cache, downstream service). If any subsystem is unhealthy, /health returns 503. Fast checks are tempting but a "healthy load balancer with broken backend" is the worst kind of red region.
We tested the failover by intentionally breaking /health on us-east-1 (a flag to override the response). Route 53 detected within 90 seconds, traffic shifted in another 60 seconds (DNS TTL). 2.5 minutes total to fail over from a region that was lying about being healthy. Acceptable.
Multi-region adds real costs:
Cross-region data transfer. $0.02/GB AWS → AWS cross-region. At our scale (~5TB/month of cross-region replication and replica reads), ~$1,200/month. Real money.
Duplicate compute. Most of us-west-2 is idle during normal operation but paid for. Our us-west-2 footprint is about 30% of us-east-1 (sized to handle ~100% of traffic during failover, but most of the time running at 5% utilization). Cost is real.
Duplicate operational state. Logs, metrics, dashboards, runbooks all need region-aware versions. Operationally, this is more annoying than expensive — every dashboard has region as a filter, every alert has region in its title.
The total multi-region tax for us is ~25% of our infrastructure bill. Worth it for the reliability win, but the bill is non-trivial.
A regional service we depended on going down even though our region was up. We use a SaaS for email; their us-east-1 endpoint went down once. Our app in us-east-1 was healthy but couldn't send emails. Failing over to us-west-2 didn't help (the SaaS endpoint was the issue, not our region). Lesson: multi-region only helps when the failure is regional. SaaS dependencies have their own failure modes.
Cross-region replication delays during major events. During an AWS-wide elevated-error-rate incident, RDS replica lag spiked to several minutes. We hadn't seen lag like that in normal operation. Made us nervous about how much data we'd lose if we had to fail over right then.
DNS caching by clients beyond our control. Some corporate networks cache DNS aggressively (10+ minutes). When we failover, those users continue hitting the old region. Not much we can do — we recommend retries; mobile apps can implement their own DNS.
Subtle data inconsistencies during failover game days. Idempotency keys for some background jobs were stored only in the primary. After failover and failback, some jobs ran twice. We migrated all idempotency keys to a multi-region store (DynamoDB global tables).
Decisions we'd revisit:
Started with single-region, made it multi-region later. Painful retrofit. If we'd been multi-region from the start, the data layer would be different (probably DynamoDB global tables instead of RDS). Going from "RDS primary in one region" to "we should be multi-region" was a multi-quarter migration.
Treating the secondary region as warm standby vs hot. We have it warm — receives 5% of traffic. A truly hot multi-region setup with active-active load balancing would be more resilient (and more complex). For us, warm is the right balance.
Game-day frequency. We do a multi-region failover game day twice a year. That's not enough — every game day reveals something we'd forgotten. Quarterly would be better.
First, are you really multi-AZ? Most teams skip multi-AZ then jump to multi-region. Multi-AZ is the bigger win for resilience and is much simpler. Get multi-AZ right before multi-region.
Pick a data strategy first. The data layer determines feasibility. Single-primary is simplest; active-active is hardest; per-region-sharded is somewhere in between.
Practice failover. A multi-region setup that's never failed over is theater, not resilience. Game days, twice a year minimum.
Account for the cost. ~25% premium on infrastructure is realistic. The benefit (resilience to regional outages) is real but the cost is also real.
Don't skip cross-region observability. Every dashboard, every alert, every runbook needs regional awareness from the start. Retrofitting this is annoying.
Multi-region is a real reliability win, but the work is mostly outside the deployment pipeline. Compute is the easy part. Data, traffic, state coherence, and operational discipline are where the actual investment happens.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
Practical game day scenarios for CI/CD: broken rollbacks, permission issues, and slow feedback loops—and how we fixed them.
How a small team moved from single-region risk to a simple active/passive multi-region setup without doubling complexity.
Explore more articles in this category
There are two hard problems in computer science." We've worked on the cache-invalidation one for a while. The patterns that hold up at scale and the ones that look clean and aren't.
We use Step Functions for batch processing, document ingestion, and a few agentic workflows. The patterns that work, the limits we hit, and where we'd reach for something else.
After two years of running Karpenter on production EKS clusters, the NodePool patterns that survived, the ones we replaced, and the tuning that matters.