We run mostly on AWS but use GCP for specific workloads. The honest cost-benefit analysis of multi-cloud, plus the patterns that make it not awful.

On this page

Multi-Cloud Infrastructure: When and How

For about three years we've been multi-cloud — primarily AWS, with specific workloads on GCP. The "multi-cloud or single-cloud" debate has gotten religious; this post is the pragmatic version. What we actually do, why, and where the costs hide.

Why we're multi-cloud at all #

We're not multi-cloud for resilience. (More on why that's mostly mythology in a moment.) We're multi-cloud because:

Specific workloads run materially better on GCP. Our ML training pipeline uses GCP's TPU offerings; the cost-per-training-hour is meaningfully lower than equivalent AWS GPU instances.
One acquisition. A team we acquired ran on GCP. Migrating them to AWS would have been months of work for marginal benefit.
Some specific GCP services are better than AWS equivalents for our use cases (BigQuery for analytics being the clearest example).

We are NOT multi-cloud for:

Avoiding vendor lock-in (we're locked in to both, just less to either)
Resilience against a cloud-wide outage (we'll come back to this)
Cost optimization across clouds (the savings rarely cover the operational overhead)

The mythology of multi-cloud resilience #

A common pitch: "if AWS goes down, our GCP failover keeps us running." In practice, this is much harder than people make it sound:

True multi-cloud failover requires running fully duplicated infrastructure in both clouds. Same services, same data, same operational tooling. Most teams that say they're "multi-cloud" don't actually do this.
Data replication across clouds is slow and expensive. Cross-cloud bandwidth costs $0.05-0.15/GB. For databases with high write volume, the bill is brutal.
Failover from AWS to GCP is a major event. DNS, IAM, networking, every dependency. Practicing this regularly is months of work.
Most "AWS outages" are scoped to a service or region, not the whole cloud. Multi-region within AWS handles 95% of these.

For most teams, multi-region within a single cloud gives you 90% of the resilience benefit at 10% of the operational cost. We do that. We do NOT do "AWS or GCP failover."

The exception: regulated industries or critical infrastructure where "what if AWS goes down" is a real threat model. Those teams have the resources and reasons. Most don't.

What we actually run where #

Our split:

On AWS (the primary):

Customer-facing apps (EKS clusters, RDS, ElastiCache, S3)
Most of our infrastructure tooling (CI/CD, monitoring, secret management)
Production data warehouse on Redshift (legacy choice; would pick BigQuery if starting today)

On GCP:

ML training pipelines (TPUs)
Analytics on BigQuery (data is replicated from production)
A few legacy services from the acquisition

Cross-cloud connectivity:

IPsec VPN between AWS VPCs and GCP VPCs (could use Direct Connect / Partner Interconnect for higher throughput)
IAM federation (workloads can assume cross-cloud roles via OIDC)
Shared monitoring (Datadog ingests from both)

What's hard about running both #

A few specific pain points:

Two of everything. Two IAM systems. Two networking stacks. Two billing dashboards. Two consoles. Engineers need to context-switch between them. We've standardized on Terraform for infrastructure to abstract some of this, but the cloud-specific resources still differ.

Cross-cloud networking is fiddly. VPN tunnels need maintenance. Throughput limits become bottlenecks for high-volume data transfer (we hit 1Gbps on a single VPN tunnel; had to set up multiple parallel tunnels). Latency between clouds varies (~30-50ms typical between us-east-1 AWS and us-central1 GCP).

Identity federation is finicky. Setting up OIDC federation so AWS workloads can assume GCP roles took 2 weeks of trial and error. Documentation exists but the failure modes are unhelpful ("invalid token" with no detail on why).

Cost visibility. Each cloud has its own billing; reconciling total cost requires pulling both into a common system. We have a custom dashboard that aggregates; building it took meaningful effort.

Vendor relationship overhead. Two account teams, two enterprise discount agreements, two reservations to manage. AWS reservations don't help with GCP commitments and vice versa.

What's easy #

The benefits we get from multi-cloud:

Best-of-breed for specific services. TPUs for ML training. BigQuery for analytics. SES for transactional email (AWS — GCP's equivalent is weaker). DynamoDB for some specific workloads.

Negotiating leverage. Both vendors know we have the other. Discount conversations go better when we can credibly threaten to migrate workloads.

No catastrophic single-vendor lock-in. We're locked into specific services on each cloud, but the ratio of services on each gives us optionality.

These benefits are real but small compared to the operational cost.

The patterns that make it bearable #

A few practices that reduce the multi-cloud tax:

Standardize everything that can be standardized. Terraform for infrastructure (different providers, similar patterns). GitHub Actions / Argo CD for CI/CD across both. Datadog for monitoring across both. Snowflake or BigQuery for analytics across both.

Pick one as primary. Most workloads run on AWS. New services default to AWS. GCP is for specific use cases. This avoids the "where should this go" decision becoming a chronic debate.

Cross-cloud only where it matters. If a service is on AWS, its dependencies should be on AWS too unless there's a specific reason. We don't deploy app servers on AWS that talk to databases on GCP — too much cross-cloud network cost and complexity.

Treat cross-cloud connectivity as a service with limits. It's not free or unlimited. Architect for "minimize cross-cloud calls" — batch data transfer, async where possible.

Strong automation around setup. Provisioning a workload that spans both clouds shouldn't be manual. Our Terraform modules handle the cross-cloud IAM federation, networking, and DNS automatically.

Cost reality #

For our org of ~40 services, multi-cloud overhead breakdown (rough):

Cross-cloud network: ~$1,200/month (data transfer)
VPN appliances and management: ~$400/month
Engineer time on cross-cloud issues: ~12 hours/month
Duplicated tooling subscriptions where one cloud doesn't qualify: ~$300/month

Total: a few thousand a month + meaningful eng time. The benefits (TPU savings, BigQuery efficiency) are also a few thousand a month plus quality wins. Roughly break-even on direct cost; positive on quality of specific services.

This is the case for our specific shape. A team that wanted multi-cloud without specific workloads driving it would pay the overhead without getting comparable benefit.

What we're not doing #

Active-active across clouds. Not worth the complexity for our risk profile.
Spot/preemptible across clouds based on price. Sounds good, doesn't work out (the savings are small, the orchestration complex).
Cross-cloud Kubernetes clusters (a single cluster with nodes in both clouds). Was a recurring pitch a few years ago; we don't see anyone doing it well in production.

Specific workloads: where each cloud shines #

AWS strengths for us:

ECS / EKS ecosystem maturity
RDS Postgres (we know it well)
IAM granularity
SES, SQS, SNS — the messaging primitives
Cost: spot instances are very cheap

GCP strengths for us:

TPUs for training
BigQuery for analytics (vs Redshift complexity)
Native Kubernetes (GKE) feels more polished than EKS
Better default networking (VPCs are global; AWS VPCs are regional)

Where we don't see clear winners:

Object storage: S3 vs GCS — both are fine
Compute: EC2 vs GCE — both are fine
Managed databases: RDS vs Cloud SQL — both have rough edges

What I'd tell a team considering multi-cloud #

Have a specific reason. "We want to be multi-cloud" without a workload-driven reason is a recipe for paying overhead without getting the benefit.

Pick one as primary. Don't try to split 50-50. The operational cognitive load is too high.

Don't do multi-cloud for "resilience" against the primary going down. Multi-region within one cloud is much cheaper for similar effective resilience.

Standardize tooling across. Terraform, monitoring, CI/CD. The differences become smaller when the abstractions are common.

Watch the cross-cloud bandwidth bill. It sneaks up. Architect to minimize cross-cloud calls.

Account for the time cost. Engineers spend real time on multi-cloud issues. Budget for it.

If you have a specific workload that runs better on a different cloud, multi-cloud is a reasonable choice. If you're chasing some abstract benefit (resilience, lock-in avoidance, "cloud agnostic"), the math usually doesn't work. The teams I've seen succeed at multi-cloud have clear, workload-driven reasons. The teams that struggle treat it as an aspiration.

Multi-Cloud Infrastructure: Managing Resources Across Providers

Multi-Cloud Infrastructure: When and How

Why we're multi-cloud at all #

The mythology of multi-cloud resilience #

What we actually run where #

What's hard about running both #

What's easy #

The patterns that make it bearable #

Cost reality #

What we're not doing #

Specific workloads: where each cloud shines #

What I'd tell a team considering multi-cloud #

Stay Updated

How We Stopped Terraform Drift from Surprising On-Call

Real-World RAG Incidents: Lessons from a Production Rollout

More from Infrastructure

Backstage Software Catalog: Getting Adoption Past the Demo

Terraform Import at Scale: Bringing Legacy Infra Under Code

Zero-Downtime Postgres Migrations: Expand-Contract in Practice

Backstage Software Catalog: Getting Adoption Past the Demo

Terraform Import at Scale: Bringing Legacy Infra Under Code

Zero-Downtime Postgres Migrations: Expand-Contract in Practice

Postgres Read Replicas: Routing Reads Without Stale-Data Bugs

AWS Graviton Migration: What Broke and What We Saved

Serverless Cold Starts: Measuring and Fixing Them on Lambda

About Kiril Urbonas

You might have missed

GitOps with Argo CD: Best Practices for 2025

Prompt Engineering Best Practices: Maximizing LLM Performance

Process Management and Monitoring in Linux