We removed the corporate VPN, set up workload identity everywhere, and made every service prove who it is on every call. The actual implementation, with what worked and what we abandoned.
The phrase "zero trust" is overused to the point of meaninglessness. In practice, what we built is: every workload has an identity, every connection is authenticated, the corporate VPN is gone, and access decisions happen per-request based on the workload's identity rather than its network location. This is the multi-cloud version, where the workloads run across AWS and GCP and need to authenticate to each other.
A few patterns we explicitly removed:
The corporate VPN. It was the gate to "internal" services. Once on the VPN, you had broad network access. Compromise of one VPN account = compromise of "internal." We replaced VPN-gated access with per-service auth.
IP allowlists between services. Service A allows ingress from Service B's CIDR. This is brittle (CIDRs change) and weak (compromised B reaches A; lateral movement is trivial).
Long-lived credentials in environment variables. Every long-lived credential is a credential that can leak.
The implicit "all internal traffic is trusted" assumption. Internal traffic gets the same auth and encryption as external traffic.
The system has four moving pieces:
Each piece is independently useful; together they replace the VPN.
Humans authenticate to Okta. Okta enforces MFA via hardware tokens. Once authenticated, Okta issues short-lived credentials (1-hour AWS sessions, GCP credentials, etc.) for backend access.
For accessing internal web services (admin dashboards, internal APIs, monitoring tools), we route through Cloudflare Access:
service.internal.company.com → Cloudflare Access challenges them → Okta SSO → back to Cloudflare → service.This is replacement #1 for the VPN. Engineers don't need to "connect to the corporate network"; they go to URLs and authenticate.
Each workload has an identity:
Cross-cloud, we use workload identity federation:
The trust setup is configured once per cross-cloud pair. After that, workloads exchange tokens dynamically and get short-lived (1-hour) credentials.
This eliminated the last set of long-lived cloud credentials we had stored anywhere. Every credential in our system rotates at most every hour.
Inside the cluster, every service-to-service call is mTLS. Both sides present certs; both sides validate. The certs encode the workload identity (e.g., spiffe://cluster.local/ns/payments/sa/api).
We use Linkerd, which handles mTLS automatically. Cert rotation is automatic (every 24h by default). We don't think about certs anymore.
The benefit is two-fold:
For cross-cluster calls, we extended this with a federated trust between cluster CAs. A workload in cluster A presenting its mesh-issued cert can authenticate to a workload in cluster B.
Authentication says "who is this." Authorization says "are they allowed to do X." We enforce authorization in the application or via a policy proxy.
For sensitive operations, the service receives the caller's identity (from mTLS or JWT) and checks against an authorization policy:
@require_workload("payments-api")
def cancel_subscription(user_id: str):
...
The decorator validates the caller is the payments-api workload (and rejects anything else with 403). Different operations can have different allowed callers.
For more complex policies, we use OPA (Open Policy Agent). The service queries OPA with the request context; OPA returns allow/deny. Centralizes policy without coupling it to the service code.
Hardware-backed identity for every workload. SPIFFE/SPIRE with TPM-backed identities is the gold standard. The complexity is real; the marginal security benefit over cloud-native identities is small for our threat model. Maybe in a few years.
Continuous device attestation for human access. "Is the device the user is on actually their managed laptop?" Cloudflare Access + Okta have features for this; we enabled basic posture checks (managed device, encrypted disk, MFA enrolled) but didn't build deeper attestation.
Application-layer encryption in addition to mTLS. Some teams encrypt data fields in the payload; we rely on transport encryption + at-rest encryption. For specific fields (PII, secrets) we layer field-level encryption; not universally.
eBPF-enforced network policies as a primary control. Cilium can do this; it's powerful. We use NetworkPolicies as a fallback layer but rely on mTLS + auth as the primary control.
It took about 18 months. The order:
The order was important. Removing the VPN before having replacements would have stranded people. Each replacement had to fully cover the VPN's role for some subset of users before we could remove anything.
A few specific issues:
Long-running scripts using long-lived credentials. Engineers had scripts on their machines using AWS access keys. When we forced SSO-only, those scripts broke. We provided a wrapper (aws-vault style) that fetches short-lived credentials and runs the script. Took a few weeks for everyone to migrate.
Hard-coded IPs in security groups. Services that whitelisted specific bastion or developer IPs needed updating to reference SGs or to use Cloudflare-tunneled access. We catalogued and migrated them over a quarter.
Cron jobs and background tasks. Some long-running jobs assumed credentials would still be valid 24 hours later. With 1-hour credentials, they had to be updated to refresh. Most cloud SDKs handle this transparently; older custom scripts didn't.
Third-party integrations. A SaaS tool we used wanted long-lived credentials. We worked with the vendor to add OIDC support; some vendors did, one didn't. The one that didn't, we replaced.
What it looks like day-to-day now:
aws sso login once per day; that's their authentication.Onboarding new tools. Every new service requires identity provisioning. We have a checklist; it's still ~30 minutes per service.
Debugging auth failures. When mTLS or token federation fails, the error messages are not always helpful. We've built runbooks for the common failure modes.
Cost. Identity-aware proxy services cost money (Cloudflare Access, AWS IAM Identity Center pricing). The total adds up to a few hundred dollars/month. Compared to "the cost of one breach," cheap. Compared to "free VPN we self-hosted," not free.
Multi-cloud token plumbing. When a workload in GCP needs to call something in AWS that needs to call something in Azure, the token-exchange chain gets complicated. We limit cross-cloud chains to two hops; deeper requires explicit intermediary services.
Start with SSO and a centralized IdP. Without that foundation, the rest doesn't compose.
Use cloud-native workload identity, not third-party. AWS IAM Roles, GCP Service Accounts. Federation between clouds. Avoid bringing in a third-party identity solution unless you have a specific reason.
Don't underestimate the migration time. Replacing VPN + long-lived credentials in a real org is 12-18 months. Plan accordingly.
Service mesh for mTLS is the easy path. Linkerd or Istio. The alternative — every service implementing TLS itself — is harder and inconsistent.
Per-call authorization grows over time. Don't try to retrofit it everywhere on day one. Start with the highest-risk operations and expand.
Decommission the VPN at the end, not the start. It's the safety net during migration. Cut it only when everyone has migrated to the replacements.
Zero trust isn't a product or a checkbox. It's a set of related decisions to remove implicit trust from your architecture. Some of those decisions are easy; some take quarters of engineering work. The end state — short-lived credentials, encrypted traffic, identity-based authorization — is meaningfully more secure than what most orgs run today, and the operational ergonomics are usually better than the VPN-based world it replaces.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
How we organize Terraform state across 12 AWS accounts and 40+ services. Backends, locking, partitioning, and the migration we got wrong twice.
We replaced three kernel-level monitoring tools with a small set of eBPF programs. What it bought us, what it cost, and where we still use the old stuff.
Explore more articles in this category
You always have known vulnerabilities. The question is how you triage, patch, and respond. The discipline we run after a few real incidents and a lot of routine work.
Three terms that get mixed up constantly. The actual differences, where each one sits in the request path, when you reach for which, and where the same tool plays all three roles.
Helm gives you a lot of rope. The patterns we used that backfired, the ones we replaced them with, and what to skip if you're starting today.