We run our app in two AWS regions for failover. The hard parts aren't the deployment — they're data consistency, traffic shifting, and the assumptions that break when "primary" is suddenly the wrong region.

On this page

Multi-Region Deployment: What Actually Matters

We run our user-facing app in two AWS regions (us-east-1 and us-west-2) for failover. The deployment story is straightforward; the rest of multi-region is where the real complexity lives. This post is what we've learned in roughly two years of running it.

What multi-region buys you #

Two real things:

Resilience to a regional outage. AWS has had region-wide events. If your app is single-region, you're down for the duration. Multi-region means you can keep serving traffic.
Lower latency for geographically distant users. A request from Singapore to us-east-1 is ~250ms RTT. To ap-southeast-1 it's ~10ms. Multi-region with geo-routing reduces tail latencies for far-away users.

Things multi-region does NOT buy you:

Better individual-AZ resilience (you should already be multi-AZ within a region).
Higher availability for individual pod / service crashes (that's a single-region problem).
Automatic disaster recovery without practice. We'll come back to this.

What we deploy where #

Our setup:

Compute runs in both regions. EKS clusters in us-east-1 and us-west-2, identical service deployments, GitOps pipeline pushes to both.
Primary database: RDS Postgres in us-east-1 with a synchronous standby in same AZ; cross-region read replica in us-west-2.
Object storage: S3 with cross-region replication for the most-important buckets.
DNS routing: Route 53 with health checks; healthy region gets traffic.

In normal operation, ~95% of traffic goes to us-east-1 (closer to most users). 5% goes to us-west-2 to keep it warm and to detect issues.

The hard part: data consistency #

Multi-region compute is easy. Multi-region data is the actual problem.

Three approaches we considered:

Single-region primary, read replicas elsewhere. What we run. Writes go to the primary; reads can use the replica. During a primary failure, we promote the replica.

Active-active multi-master. Both regions write. Conflict resolution required. We don't do this — for our shape of data (mostly user-owned, low cross-region conflict potential), the operational complexity wasn't worth it.

Per-region partitioned data. Users sharded by region (e.g., EU users in eu-west-1). Different problem entirely. We don't have meaningful regional partitioning to exploit.

The trade with single-primary: writes always go to one region. From us-west-2, a write incurs the cross-region latency (~70ms). For our workload, write latency dominated by user perception (post-and-redirect flows) rather than by raw write speed; the 70ms is acceptable.

Failover: harder than it sounds #

The failover plan:

Promote the cross-region read replica to primary
Update all services to point to the new primary's endpoint
Drain traffic from the failed region
Continue serving from the surviving region

We've practiced this in game days. Each step has gotchas:

Replica promotion takes time. RDS cross-region read replica promotion is ~5-15 minutes. During this window, writes are unavailable.

Endpoint switching. Services connect via a Postgres endpoint. We use a DNS-based endpoint that we update during failover. Most services pick up the change within 30s (DNS TTL), but some have stale connections that hang. We added connection-recycling on errors.

Replica lag at the moment of failure. If the replica was 15s behind when we cut over, those 15 seconds of writes are lost. For our app this is acceptable (no payments, no critical state at that level); for a bank it wouldn't be.

Failback. Going back to the original primary after recovery is its own dance. We typically don't fail back immediately — we run on the secondary until a planned maintenance window.

We can do a planned failover in ~10 minutes. An unplanned one takes longer because of the time to confirm "yes, this is real, the region is down" before triggering the switch.

Traffic routing #

Three approaches available:

Latency-based routing. Route 53 sends users to the lowest-latency healthy region. Good for normal operation. Doesn't load-balance — if the closer region is healthy but overloaded, users still go there.

Failover routing. Primary region is preferred; secondary used only when primary fails health checks. Simpler to reason about. We used this for the first year.

Geo-routing. Specific countries → specific regions. Useful if you have data residency requirements or distinct user populations.

We're now on a hybrid: latency-based for active-active routing during normal operation, with failover-style behavior baked into the health checks. If a region goes red, all traffic shifts to the other.

Health checks: the linchpin #

The health check is what triggers failover. Get it wrong and you either:

Fail over too eagerly (a transient blip routes everyone to the secondary, which gets overloaded)
Fail over too late (you're down for 10 minutes before the check fires)

Our health check:

code

Endpoint: https://app.example.com/health
Expected status: 200
Body match: "ok"
Interval: 30s
Failure threshold: 3

The /health endpoint hits a deep-stack check (database, cache, downstream service). If any subsystem is unhealthy, /health returns 503. Fast checks are tempting but a "healthy load balancer with broken backend" is the worst kind of red region.

We tested the failover by intentionally breaking /health on us-east-1 (a flag to override the response). Route 53 detected within 90 seconds, traffic shifted in another 60 seconds (DNS TTL). 2.5 minutes total to fail over from a region that was lying about being healthy. Acceptable.

What costs more than expected #

Multi-region adds real costs:

Cross-region data transfer. $0.02/GB AWS → AWS cross-region. At our scale (~5TB/month of cross-region replication and replica reads), ~$1,200/month. Real money.

Duplicate compute. Most of us-west-2 is idle during normal operation but paid for. Our us-west-2 footprint is about 30% of us-east-1 (sized to handle ~100% of traffic during failover, but most of the time running at 5% utilization). Cost is real.

Duplicate operational state. Logs, metrics, dashboards, runbooks all need region-aware versions. Operationally, this is more annoying than expensive — every dashboard has region as a filter, every alert has region in its title.

The total multi-region tax for us is ~25% of our infrastructure bill. Worth it for the reliability win, but the bill is non-trivial.

What's surprised us in production #

A regional service we depended on going down even though our region was up. We use a SaaS for email; their us-east-1 endpoint went down once. Our app in us-east-1 was healthy but couldn't send emails. Failing over to us-west-2 didn't help (the SaaS endpoint was the issue, not our region). Lesson: multi-region only helps when the failure is regional. SaaS dependencies have their own failure modes.

Cross-region replication delays during major events. During an AWS-wide elevated-error-rate incident, RDS replica lag spiked to several minutes. We hadn't seen lag like that in normal operation. Made us nervous about how much data we'd lose if we had to fail over right then.

DNS caching by clients beyond our control. Some corporate networks cache DNS aggressively (10+ minutes). When we failover, those users continue hitting the old region. Not much we can do — we recommend retries; mobile apps can implement their own DNS.

Subtle data inconsistencies during failover game days. Idempotency keys for some background jobs were stored only in the primary. After failover and failback, some jobs ran twice. We migrated all idempotency keys to a multi-region store (DynamoDB global tables).

What we wish we'd designed differently #

Decisions we'd revisit:

Started with single-region, made it multi-region later. Painful retrofit. If we'd been multi-region from the start, the data layer would be different (probably DynamoDB global tables instead of RDS). Going from "RDS primary in one region" to "we should be multi-region" was a multi-quarter migration.

Treating the secondary region as warm standby vs hot. We have it warm — receives 5% of traffic. A truly hot multi-region setup with active-active load balancing would be more resilient (and more complex). For us, warm is the right balance.

Game-day frequency. We do a multi-region failover game day twice a year. That's not enough — every game day reveals something we'd forgotten. Quarterly would be better.

What I'd tell a team considering it #

First, are you really multi-AZ? Most teams skip multi-AZ then jump to multi-region. Multi-AZ is the bigger win for resilience and is much simpler. Get multi-AZ right before multi-region.

Pick a data strategy first. The data layer determines feasibility. Single-primary is simplest; active-active is hardest; per-region-sharded is somewhere in between.

Practice failover. A multi-region setup that's never failed over is theater, not resilience. Game days, twice a year minimum.

Account for the cost. ~25% premium on infrastructure is realistic. The benefit (resilience to regional outages) is real but the cost is also real.

Don't skip cross-region observability. Every dashboard, every alert, every runbook needs regional awareness from the start. Retrofitting this is annoying.

Multi-region is a real reliability win, but the work is mostly outside the deployment pipeline. Compute is the easy part. Data, traffic, state coherence, and operational discipline are where the actual investment happens.

Multi-Region Deployment: Building Resilient Cloud Applications

Multi-Region Deployment: What Actually Matters

What multi-region buys you #

What we deploy where #

The hard part: data consistency #

Failover: harder than it sounds #

Traffic routing #

Health checks: the linchpin #

What costs more than expected #

What's surprised us in production #

What we wish we'd designed differently #

What I'd tell a team considering it #

Stay Updated

What We Learned Running Weekly Game Days on Our CI/CD Pipeline

A Pragmatic Multi-Region Strategy for Small Teams

More from Cloud

External Secrets Operator: One Secrets Workflow Across Clouds

AWS Graviton Migration: What Broke and What We Saved

Serverless Cold Starts: Measuring and Fixing Them on Lambda

External Secrets Operator: One Secrets Workflow Across Clouds

AWS Graviton Migration: What Broke and What We Saved

Serverless Cold Starts: Measuring and Fixing Them on Lambda

Multi-Region Failover with Route 53: Health Checks and Gotchas

NAT Gateway Costs: The Silent Line Item and How to Cut It

Terraform Import at Scale: Bringing Legacy Infra Under Code

You might have missed

GitOps with Argo CD: Best Practices for 2025

Prompt Engineering Best Practices: Maximizing LLM Performance

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025

About Kiril Urbonas