We removed the corporate VPN, set up workload identity everywhere, and made every service prove who it is on every call. The actual implementation, with what worked and what we abandoned.

On this page

Zero Trust in a Multi-Cloud Environment

The phrase "zero trust" is overused to the point of meaninglessness. In practice, what we built is: every workload has an identity, every connection is authenticated, the corporate VPN is gone, and access decisions happen per-request based on the workload's identity rather than its network location. This is the multi-cloud version, where the workloads run across AWS and GCP and need to authenticate to each other.

What we threw out #

A few patterns we explicitly removed:

The corporate VPN. It was the gate to "internal" services. Once on the VPN, you had broad network access. Compromise of one VPN account = compromise of "internal." We replaced VPN-gated access with per-service auth.

IP allowlists between services. Service A allows ingress from Service B's CIDR. This is brittle (CIDRs change) and weak (compromised B reaches A; lateral movement is trivial).

Long-lived credentials in environment variables. Every long-lived credential is a credential that can leak.

The implicit "all internal traffic is trusted" assumption. Internal traffic gets the same auth and encryption as external traffic.

The components we run #

The system has four moving pieces:

Identity providers for humans (Okta) and workloads (cloud-native — IAM Roles for AWS, Service Accounts for GCP).
Identity-aware proxies in front of every internal service that humans access (we use Cloudflare Access; could be Tailscale, AWS IAM Identity Center, or self-hosted Pomerium).
mTLS between services (via service mesh — Linkerd in our case).
Per-call authorization in services for sensitive operations.

Each piece is independently useful; together they replace the VPN.

Identity for humans: SSO + IdP-aware proxies #

Humans authenticate to Okta. Okta enforces MFA via hardware tokens. Once authenticated, Okta issues short-lived credentials (1-hour AWS sessions, GCP credentials, etc.) for backend access.

For accessing internal web services (admin dashboards, internal APIs, monitoring tools), we route through Cloudflare Access:

The service has no public direct access (it's behind a Cloudflare-only ingress).
Users go to service.internal.company.com → Cloudflare Access challenges them → Okta SSO → back to Cloudflare → service.
The service receives a JWT from Cloudflare with the user's identity and group membership.
The service makes per-route authorization decisions.

This is replacement #1 for the VPN. Engineers don't need to "connect to the corporate network"; they go to URLs and authenticate.

Identity for workloads: cloud-native, federated #

Each workload has an identity:

AWS: IAM role attached to the EC2 instance / ECS task / EKS pod (via IRSA) / Lambda.
GCP: Service Account attached to the GCE instance / GKE workload / Cloud Run service.

Cross-cloud, we use workload identity federation:

A GCP workload assumes an AWS role via OIDC token federation. No long-lived AWS access keys in GCP.
An AWS workload assumes a GCP service account via OIDC token federation. No long-lived GCP keys in AWS.

The trust setup is configured once per cross-cloud pair. After that, workloads exchange tokens dynamically and get short-lived (1-hour) credentials.

This eliminated the last set of long-lived cloud credentials we had stored anywhere. Every credential in our system rotates at most every hour.

mTLS between services #

Inside the cluster, every service-to-service call is mTLS. Both sides present certs; both sides validate. The certs encode the workload identity (e.g., spiffe://cluster.local/ns/payments/sa/api).

We use Linkerd, which handles mTLS automatically. Cert rotation is automatic (every 24h by default). We don't think about certs anymore.

The benefit is two-fold:

Encryption in transit is universal. No more "is this connection TLS or not?" — yes.
Authenticated identity for authorization decisions. The receiving service knows exactly which workload is calling.

For cross-cluster calls, we extended this with a federated trust between cluster CAs. A workload in cluster A presenting its mesh-issued cert can authenticate to a workload in cluster B.

Per-call authorization #

Authentication says "who is this." Authorization says "are they allowed to do X." We enforce authorization in the application or via a policy proxy.

For sensitive operations, the service receives the caller's identity (from mTLS or JWT) and checks against an authorization policy:

python.python

@require_workload("payments-api")
def cancel_subscription(user_id: str):
    ...

The decorator validates the caller is the payments-api workload (and rejects anything else with 403). Different operations can have different allowed callers.

For more complex policies, we use OPA (Open Policy Agent). The service queries OPA with the request context; OPA returns allow/deny. Centralizes policy without coupling it to the service code.

What we deliberately didn't do #

Hardware-backed identity for every workload. SPIFFE/SPIRE with TPM-backed identities is the gold standard. The complexity is real; the marginal security benefit over cloud-native identities is small for our threat model. Maybe in a few years.

Continuous device attestation for human access. "Is the device the user is on actually their managed laptop?" Cloudflare Access + Okta have features for this; we enabled basic posture checks (managed device, encrypted disk, MFA enrolled) but didn't build deeper attestation.

Application-layer encryption in addition to mTLS. Some teams encrypt data fields in the payload; we rely on transport encryption + at-rest encryption. For specific fields (PII, secrets) we layer field-level encryption; not universally.

eBPF-enforced network policies as a primary control. Cilium can do this; it's powerful. We use NetworkPolicies as a fallback layer but rely on mTLS + auth as the primary control.

Migration: how we got from "VPN-everything" to here #

It took about 18 months. The order:

SSO for everything. Okta in front of every cloud console, every internal tool. This was 3 months — slow because changing auth on each tool involves coordinating with the tool's admin.
Identity-aware proxy for internal web tools. Cloudflare Access in front of admin dashboards. ~2 months.
Workload identity for cloud-to-cloud. Federation setup, replace long-lived keys with short-lived federated credentials. ~3 months.
mTLS via service mesh. ~4 months including running the mesh and tuning.
Per-call authorization for sensitive operations. Ongoing — about 60% of services have it now.
Decommission the VPN. When everyone had SSO + IDP-aware proxy + cloud federation, the VPN had no remaining users. We turned it off; nobody complained.

The order was important. Removing the VPN before having replacements would have stranded people. Each replacement had to fully cover the VPN's role for some subset of users before we could remove anything.

What broke during migration #

A few specific issues:

Long-running scripts using long-lived credentials. Engineers had scripts on their machines using AWS access keys. When we forced SSO-only, those scripts broke. We provided a wrapper (aws-vault style) that fetches short-lived credentials and runs the script. Took a few weeks for everyone to migrate.

Hard-coded IPs in security groups. Services that whitelisted specific bastion or developer IPs needed updating to reference SGs or to use Cloudflare-tunneled access. We catalogued and migrated them over a quarter.

Cron jobs and background tasks. Some long-running jobs assumed credentials would still be valid 24 hours later. With 1-hour credentials, they had to be updated to refresh. Most cloud SDKs handle this transparently; older custom scripts didn't.

Third-party integrations. A SaaS tool we used wanted long-lived credentials. We worked with the vendor to add OIDC support; some vendors did, one didn't. The one that didn't, we replaced.

Operational reality #

What it looks like day-to-day now:

Engineers aws sso login once per day; that's their authentication.
Workloads run with cloud-native identity; no key management.
Service-to-service traffic is encrypted and authenticated automatically.
Audit logs show "user X via Okta did Y" or "workload A called workload B" — clean audit trail.
Security incident on a leaked credential: blast radius is minutes (until the 1-hour credential expires) instead of months.

What's hard #

Onboarding new tools. Every new service requires identity provisioning. We have a checklist; it's still ~30 minutes per service.

Debugging auth failures. When mTLS or token federation fails, the error messages are not always helpful. We've built runbooks for the common failure modes.

Cost. Identity-aware proxy services cost money (Cloudflare Access, AWS IAM Identity Center pricing). The total adds up to a few hundred dollars/month. Compared to "the cost of one breach," cheap. Compared to "free VPN we self-hosted," not free.

Multi-cloud token plumbing. When a workload in GCP needs to call something in AWS that needs to call something in Azure, the token-exchange chain gets complicated. We limit cross-cloud chains to two hops; deeper requires explicit intermediary services.

What I'd tell a team starting #

Start with SSO and a centralized IdP. Without that foundation, the rest doesn't compose.

Use cloud-native workload identity, not third-party. AWS IAM Roles, GCP Service Accounts. Federation between clouds. Avoid bringing in a third-party identity solution unless you have a specific reason.

Don't underestimate the migration time. Replacing VPN + long-lived credentials in a real org is 12-18 months. Plan accordingly.

Service mesh for mTLS is the easy path. Linkerd or Istio. The alternative — every service implementing TLS itself — is harder and inconsistent.

Per-call authorization grows over time. Don't try to retrofit it everywhere on day one. Start with the highest-risk operations and expand.

Decommission the VPN at the end, not the start. It's the safety net during migration. Cut it only when everyone has migrated to the replacements.

Zero trust isn't a product or a checkbox. It's a set of related decisions to remove implicit trust from your architecture. Some of those decisions are easy; some take quarters of engineering work. The end state — short-lived credentials, encrypted traffic, identity-based authorization — is meaningfully more secure than what most orgs run today, and the operational ergonomics are usually better than the VPN-based world it replaces.

Zero Trust Architecture in Multi-Cloud

Zero Trust in a Multi-Cloud Environment

What we threw out #

The components we run #

Identity for humans: SSO + IdP-aware proxies #

Identity for workloads: cloud-native, federated #

mTLS between services #

Per-call authorization #

What we deliberately didn't do #

Migration: how we got from "VPN-everything" to here #

What broke during migration #

Operational reality #

What's hard #

What I'd tell a team starting #

Stay Updated

Terraform State Management Strategies

eBPF: The Future of Kernel Observability

More from DevOps

Feature Flags for Safe Deploys: Decoupling Release From Deploy

On-Call Without Burnout: Rotations, Runbooks, and Escalation

Blameless Postmortems: The Template and Facilitation That Works

Feature Flags for Safe Deploys: Decoupling Release From Deploy

On-Call Without Burnout: Rotations, Runbooks, and Escalation

Blameless Postmortems: The Template and Facilitation That Works

Four Signals That Matter: Choosing SLIs Users Actually Feel

External Secrets Operator: One Secrets Workflow Across Clouds

Kustomize Overlays That Scale Across Environments

You might have missed

Prompt Engineering Best Practices: Maximizing LLM Performance

Process Management and Monitoring in Linux

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025

About Kiril Urbonas