We run mostly on AWS but use GCP for specific workloads. The honest cost-benefit analysis of multi-cloud, plus the patterns that make it not awful.
For about three years we've been multi-cloud — primarily AWS, with specific workloads on GCP. The "multi-cloud or single-cloud" debate has gotten religious; this post is the pragmatic version. What we actually do, why, and where the costs hide.
We're not multi-cloud for resilience. (More on why that's mostly mythology in a moment.) We're multi-cloud because:
We are NOT multi-cloud for:
A common pitch: "if AWS goes down, our GCP failover keeps us running." In practice, this is much harder than people make it sound:
For most teams, multi-region within a single cloud gives you 90% of the resilience benefit at 10% of the operational cost. We do that. We do NOT do "AWS or GCP failover."
The exception: regulated industries or critical infrastructure where "what if AWS goes down" is a real threat model. Those teams have the resources and reasons. Most don't.
Our split:
On AWS (the primary):
On GCP:
Cross-cloud connectivity:
A few specific pain points:
Two of everything. Two IAM systems. Two networking stacks. Two billing dashboards. Two consoles. Engineers need to context-switch between them. We've standardized on Terraform for infrastructure to abstract some of this, but the cloud-specific resources still differ.
Cross-cloud networking is fiddly. VPN tunnels need maintenance. Throughput limits become bottlenecks for high-volume data transfer (we hit 1Gbps on a single VPN tunnel; had to set up multiple parallel tunnels). Latency between clouds varies (~30-50ms typical between us-east-1 AWS and us-central1 GCP).
Identity federation is finicky. Setting up OIDC federation so AWS workloads can assume GCP roles took 2 weeks of trial and error. Documentation exists but the failure modes are unhelpful ("invalid token" with no detail on why).
Cost visibility. Each cloud has its own billing; reconciling total cost requires pulling both into a common system. We have a custom dashboard that aggregates; building it took meaningful effort.
Vendor relationship overhead. Two account teams, two enterprise discount agreements, two reservations to manage. AWS reservations don't help with GCP commitments and vice versa.
The benefits we get from multi-cloud:
Best-of-breed for specific services. TPUs for ML training. BigQuery for analytics. SES for transactional email (AWS — GCP's equivalent is weaker). DynamoDB for some specific workloads.
Negotiating leverage. Both vendors know we have the other. Discount conversations go better when we can credibly threaten to migrate workloads.
No catastrophic single-vendor lock-in. We're locked into specific services on each cloud, but the ratio of services on each gives us optionality.
These benefits are real but small compared to the operational cost.
A few practices that reduce the multi-cloud tax:
Standardize everything that can be standardized. Terraform for infrastructure (different providers, similar patterns). GitHub Actions / Argo CD for CI/CD across both. Datadog for monitoring across both. Snowflake or BigQuery for analytics across both.
Pick one as primary. Most workloads run on AWS. New services default to AWS. GCP is for specific use cases. This avoids the "where should this go" decision becoming a chronic debate.
Cross-cloud only where it matters. If a service is on AWS, its dependencies should be on AWS too unless there's a specific reason. We don't deploy app servers on AWS that talk to databases on GCP — too much cross-cloud network cost and complexity.
Treat cross-cloud connectivity as a service with limits. It's not free or unlimited. Architect for "minimize cross-cloud calls" — batch data transfer, async where possible.
Strong automation around setup. Provisioning a workload that spans both clouds shouldn't be manual. Our Terraform modules handle the cross-cloud IAM federation, networking, and DNS automatically.
For our org of ~40 services, multi-cloud overhead breakdown (rough):
Total: a few thousand a month + meaningful eng time. The benefits (TPU savings, BigQuery efficiency) are also a few thousand a month plus quality wins. Roughly break-even on direct cost; positive on quality of specific services.
This is the case for our specific shape. A team that wanted multi-cloud without specific workloads driving it would pay the overhead without getting comparable benefit.
AWS strengths for us:
GCP strengths for us:
Where we don't see clear winners:
Have a specific reason. "We want to be multi-cloud" without a workload-driven reason is a recipe for paying overhead without getting the benefit.
Pick one as primary. Don't try to split 50-50. The operational cognitive load is too high.
Don't do multi-cloud for "resilience" against the primary going down. Multi-region within one cloud is much cheaper for similar effective resilience.
Standardize tooling across. Terraform, monitoring, CI/CD. The differences become smaller when the abstractions are common.
Watch the cross-cloud bandwidth bill. It sneaks up. Architect to minimize cross-cloud calls.
Account for the time cost. Engineers spend real time on multi-cloud issues. Budget for it.
If you have a specific workload that runs better on a different cloud, multi-cloud is a reasonable choice. If you're chasing some abstract benefit (resilience, lock-in avoidance, "cloud agnostic"), the math usually doesn't work. The teams I've seen succeed at multi-cloud have clear, workload-driven reasons. The teams that struggle treat it as an aspiration.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
A real story of removing console-only changes, adding drift detection, and getting Terraform back in charge.
A field report from rolling out retrieval-augmented generation in production, including cache bugs, bad embeddings, and how we fixed them.
Explore more articles in this category
Backups are easy. Restores are hard. The quarterly drill we run, what's failed during it, and the discipline that makes "we have backups" actually mean something.
Replication is the foundation of database HA. What we monitor, how we practice failover, and the gotchas that show up only when you actually fail over.
Why Postgres connection limits bite at unexpected times, the pooling layer we put in front, and the pool-mode tradeoffs we learned the hard way.