A flat VPC is fine until you need to prove who can reach what. Five segmentation patterns that work in AWS without requiring a service mesh.
When we built our AWS VPC in 2022, we used the AWS quick-start template: one VPC, public subnets for ingress, private subnets for everything else, a NAT gateway per AZ for egress. It worked. It also meant that anything in the private subnets could reach anything else in the private subnets, and proving "service A cannot reach the customer database" required reading IAM policies and security group rules across 30+ resources.
By 2024, the security team had questions. We did a segmentation refactor. The result is below — five patterns we now use, none requiring a service mesh, all auditable.
Three things drove the design:
We have three subnet tiers in each VPC:
Each tier has its own subnet (one per AZ), its own route table, and its own NACL.
VPC 10.0.0.0/16
├── public 10.0.0.0/24, 10.0.1.0/24, 10.0.2.0/24
├── app 10.0.10.0/24, 10.0.11.0/24, 10.0.12.0/24
└── data 10.0.20.0/24, 10.0.21.0/24, 10.0.22.0/24
The data tier has NO route to the internet — not via NAT, not via VPC peering. If a database in this tier ever wants to call out, it can't. Useful both as defense-in-depth (a compromised database can't exfiltrate freely) and as a forcing function (we noticed a workload was making outbound HTTP calls that should have been inbound; the data-tier placement caught it).
We use security groups as the actual policy enforcement layer. Naming convention:
sg-app-server (general application servers)
sg-database-postgres (postgres servers)
sg-cache-redis (redis servers)
sg-bastion (jump hosts)
sg-loadbalancer-public (public ALBs)
Each security group's inbound rules reference other security groups, not CIDR blocks:
resource "aws_security_group_rule" "postgres_from_app" {
type = "ingress"
from_port = 5432
to_port = 5432
protocol = "tcp"
security_group_id = aws_security_group.database_postgres.id
source_security_group_id = aws_security_group.app_server.id
}
The advantage: when an app server scales up or moves to a new IP, no rules need to update. The membership-based reference handles it.
The auditability win: "who can reach postgres" = grep the security group config for aws_security_group.database_postgres.id as the target, list the source SGs. Done in 30 seconds.
Some workloads in our staging account legitimately need to reach a copy of production data (anonymized). Cross-account, cross-VPC traffic happens via VPC peering with route table entries only on the specific subnets that need it:
[ staging app-tier ] --peering--> [ prod data-tier (anonymized replica) ]
The peering exists. The route is only in staging-app's route table → prod-anonymized-replica subnet. No reverse route. Production cannot reach staging via the peering connection.
The security group on the production side then explicitly allows the staging app-tier's CIDR.
This pattern handles the "I need cross-VPC access" case without resorting to opening larger holes.
When a workload in one account needs to reach a service in another (e.g., our analytics workload calling our internal data-platform), PrivateLink is the right tool.
[ workload account ] --VPC endpoint--> [ data-platform NLB --PrivateLink--> data-platform service ]
The benefits over VPC peering:
We use PrivateLink for any cross-account internal API. The internal-DNS layer points at the VPC endpoint; from the consumer's perspective, it's a normal internal hostname.
Pattern 1 prevented our data tier from reaching the internet at all. For the application tier, which legitimately reaches third-party APIs, we don't ban egress — we filter it.
AWS Network Firewall sits in front of the NAT gateway:
[ app subnets ] → [ network firewall ] → [ NAT gateway ] → [ internet ]
The firewall has stateful rules:
*.openai.com (we use OpenAI APIs)*.amazonaws.com (AWS service APIs)*.datadoghq.com (monitoring)Outbound traffic to anything outside the allowlist is dropped and logged. The logs go to CloudWatch and we have alerts on unusual outbound destinations.
This caught one real issue: a developer testing a new dependency pulled in a transitively-dependent package that called out to an analytics service we hadn't approved. The firewall blocked it; the log entry surfaced the issue within an hour.
A typical inter-service call now flows like:
public ALB
-> security group: allows :443 from 0.0.0.0/0 (it's a public LB)
-> app server (in app subnet)
-> security group: allows :443 from sg-loadbalancer-public, allows :80 to sg-cache-redis, allows :5432 to sg-database-postgres
-> redis or postgres (in data subnet)
-> security group: allows :6379 from sg-app-server (redis)
-> security group: allows :5432 from sg-app-server (postgres)
Each connection traverses at minimum: subnet route table, source SG egress rules, destination SG ingress rules. Each is a control point. If anything is misconfigured, the connection fails fast.
ModifySecurityGroupRules). Alerts on unusual modifications outside CI windows.A few things we considered and rejected:
Service mesh for east-west zero-trust (Istio, Linkerd, Cilium with mTLS). Powerful but operationally heavy. Our security groups + NACLs achieve most of what we need at the network layer; mTLS at the application layer is a separate concern we handle with Istio inside the cluster, not across clusters.
Per-service VPCs. Some teams advocate one VPC per microservice. We tried it on a handful and the operational overhead was massive (peering, route tables, DNS, etc.). We use one VPC per environment per region instead, with security groups for service-level segmentation.
Inline traffic inspection (DPI, IDS-style). Heavy, expensive, and our threat model didn't justify it. Network firewall does L7 hostname filtering, which is sufficient.
We started without explicit subnet tier separation. The data tier was added later by re-laying networking. If we were starting fresh, we'd build the tiers from day one — adding them later is doable but tedious.
We also accumulated some legacy CIDR-based security group rules early on. Migrating to membership-based references was a quarter-long project. Starting with membership-based is the right move.
The audit time win was huge. Before this, "can service X reach database Y" was a question that took an engineer 30-60 minutes to answer with confidence. Now it's a grep of the security group config, plus a check of the route tables, plus a glance at the network firewall. 5 minutes max.
The performance hit was negligible. I worried that adding network firewall to the egress path would cost noticeable latency. It doesn't — added <1ms. AWS does the work efficiently.
Application teams got faster, not slower. Counterintuitive. The clearer rules made it easier for app teams to know "what do I need to add to make my service talk to redis" — they look at how other services do it, copy the pattern, done. Before the cleanup, the answer was "ask the platform team" and that was a queue.
Start with subnet tiering. It's structural and adding it later is the most painful part of the refactor. Public, app, data — three tiers. Most cases fit one of those.
Use security group references, not CIDRs. Even if you have a small fleet today, the membership-based references age much better.
Pick whether you'll use a network firewall for egress filtering early. If yes, the routing has to be set up to direct traffic through it from day one. If no, you can add it later but the data-tier-no-egress pattern (no internet at all) is still worth adopting.
The temptation is to defer security and "do it later when we have time." Network architecture doesn't refactor cleanly later. The earlier the segmentation goes in, the less it costs.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
Concrete systemd unit patterns that reduced flakiness: restart policies, resource limits, and structured logs.
A real story of removing console-only changes, adding drift detection, and getting Terraform back in charge.
Explore more articles in this category
There are two hard problems in computer science." We've worked on the cache-invalidation one for a while. The patterns that hold up at scale and the ones that look clean and aren't.
We use Step Functions for batch processing, document ingestion, and a few agentic workflows. The patterns that work, the limits we hit, and where we'd reach for something else.
After two years of running Karpenter on production EKS clusters, the NodePool patterns that survived, the ones we replaced, and the tuning that matters.