A flat VPC is fine until you need to prove who can reach what. Five segmentation patterns that work in AWS without requiring a service mesh.

On this page

Cloud Networking Segmentation Patterns

When we built our AWS VPC in 2022, we used the AWS quick-start template: one VPC, public subnets for ingress, private subnets for everything else, a NAT gateway per AZ for egress. It worked. It also meant that anything in the private subnets could reach anything else in the private subnets, and proving "service A cannot reach the customer database" required reading IAM policies and security group rules across 30+ resources.

By 2024, the security team had questions. We did a segmentation refactor. The result is below — five patterns we now use, none requiring a service mesh, all auditable.

The principles #

Three things drove the design:

Default-deny. New workloads can't talk to anything they haven't been explicitly allowed to talk to.
Auditability. The answer to "who can reach the customer database" should be a one-page list, not a multi-hour investigation.
Operational sanity. We have ~8 engineers managing this. Patterns that require a full-time network engineer to maintain are out.

Pattern 1: Tier separation by subnet #

We have three subnet tiers in each VPC:

Public (with internet gateway): only ingress load balancers
Application (private, NAT egress allowed): service workloads
Data (private, no NAT, no internet egress): databases, queues, secret stores

Each tier has its own subnet (one per AZ), its own route table, and its own NACL.

code

VPC 10.0.0.0/16
├── public  10.0.0.0/24, 10.0.1.0/24, 10.0.2.0/24
├── app     10.0.10.0/24, 10.0.11.0/24, 10.0.12.0/24
└── data    10.0.20.0/24, 10.0.21.0/24, 10.0.22.0/24

The data tier has NO route to the internet — not via NAT, not via VPC peering. If a database in this tier ever wants to call out, it can't. Useful both as defense-in-depth (a compromised database can't exfiltrate freely) and as a forcing function (we noticed a workload was making outbound HTTP calls that should have been inbound; the data-tier placement caught it).

Pattern 2: Security groups as the primary control #

We use security groups as the actual policy enforcement layer. Naming convention:

code

sg-app-server          (general application servers)
sg-database-postgres   (postgres servers)
sg-cache-redis         (redis servers)
sg-bastion             (jump hosts)
sg-loadbalancer-public (public ALBs)

Each security group's inbound rules reference other security groups, not CIDR blocks:

hcl.hcl

resource "aws_security_group_rule" "postgres_from_app" {
  type                     = "ingress"
  from_port                = 5432
  to_port                  = 5432
  protocol                 = "tcp"
  security_group_id        = aws_security_group.database_postgres.id
  source_security_group_id = aws_security_group.app_server.id
}

The advantage: when an app server scales up or moves to a new IP, no rules need to update. The membership-based reference handles it.

The auditability win: "who can reach postgres" = grep the security group config for aws_security_group.database_postgres.id as the target, list the source SGs. Done in 30 seconds.

Pattern 3: VPC peering with explicit traffic acceptance #

Some workloads in our staging account legitimately need to reach a copy of production data (anonymized). Cross-account, cross-VPC traffic happens via VPC peering with route table entries only on the specific subnets that need it:

code

[ staging app-tier ] --peering--> [ prod data-tier (anonymized replica) ]

The peering exists. The route is only in staging-app's route table → prod-anonymized-replica subnet. No reverse route. Production cannot reach staging via the peering connection.

The security group on the production side then explicitly allows the staging app-tier's CIDR.

This pattern handles the "I need cross-VPC access" case without resorting to opening larger holes.

Pattern 4: PrivateLink for cross-account services #

When a workload in one account needs to reach a service in another (e.g., our analytics workload calling our internal data-platform), PrivateLink is the right tool.

code

[ workload account ] --VPC endpoint--> [ data-platform NLB --PrivateLink--> data-platform service ]

The benefits over VPC peering:

The data-platform service exposes only specific ports
The consumer can't see (or access) the producer's full VPC
IAM policies control who can create endpoints

We use PrivateLink for any cross-account internal API. The internal-DNS layer points at the VPC endpoint; from the consumer's perspective, it's a normal internal hostname.

Pattern 5: Egress firewall (AWS Network Firewall) for outbound control #

Pattern 1 prevented our data tier from reaching the internet at all. For the application tier, which legitimately reaches third-party APIs, we don't ban egress — we filter it.

AWS Network Firewall sits in front of the NAT gateway:

code

[ app subnets ] → [ network firewall ] → [ NAT gateway ] → [ internet ]

The firewall has stateful rules:

Allow *.openai.com (we use OpenAI APIs)
Allow *.amazonaws.com (AWS service APIs)
Allow *.datadoghq.com (monitoring)
Allow specific known third parties (Stripe, Mailchimp, etc.)
Deny everything else, log

Outbound traffic to anything outside the allowlist is dropped and logged. The logs go to CloudWatch and we have alerts on unusual outbound destinations.

This caught one real issue: a developer testing a new dependency pulled in a transitively-dependent package that called out to an analytics service we hadn't approved. The firewall blocked it; the log entry surfaced the issue within an hour.

What this combination buys us #

A typical inter-service call now flows like:

code

public ALB
  -> security group: allows :443 from 0.0.0.0/0 (it's a public LB)
  -> app server (in app subnet)
       -> security group: allows :443 from sg-loadbalancer-public, allows :80 to sg-cache-redis, allows :5432 to sg-database-postgres
  -> redis or postgres (in data subnet)
       -> security group: allows :6379 from sg-app-server (redis)
       -> security group: allows :5432 from sg-app-server (postgres)

Each connection traverses at minimum: subnet route table, source SG egress rules, destination SG ingress rules. Each is a control point. If anything is misconfigured, the connection fails fast.

What we monitor #

Security group changes (CloudTrail event ModifySecurityGroupRules). Alerts on unusual modifications outside CI windows.
NACL drops at the data tier (theoretically zero; any non-zero is investigated)
Network firewall drops by destination (allowed-but-rare destinations get aggregated; new destinations alert)
Cross-account peering traffic volume (sudden spikes indicate misconfiguration or compromise)

Common patterns we don't use #

A few things we considered and rejected:

Service mesh for east-west zero-trust (Istio, Linkerd, Cilium with mTLS). Powerful but operationally heavy. Our security groups + NACLs achieve most of what we need at the network layer; mTLS at the application layer is a separate concern we handle with Istio inside the cluster, not across clusters.

Per-service VPCs. Some teams advocate one VPC per microservice. We tried it on a handful and the operational overhead was massive (peering, route tables, DNS, etc.). We use one VPC per environment per region instead, with security groups for service-level segmentation.

Inline traffic inspection (DPI, IDS-style). Heavy, expensive, and our threat model didn't justify it. Network firewall does L7 hostname filtering, which is sufficient.

What we'd do differently #

We started without explicit subnet tier separation. The data tier was added later by re-laying networking. If we were starting fresh, we'd build the tiers from day one — adding them later is doable but tedious.

We also accumulated some legacy CIDR-based security group rules early on. Migrating to membership-based references was a quarter-long project. Starting with membership-based is the right move.

What surprised us #

The audit time win was huge. Before this, "can service X reach database Y" was a question that took an engineer 30-60 minutes to answer with confidence. Now it's a grep of the security group config, plus a check of the route tables, plus a glance at the network firewall. 5 minutes max.

The performance hit was negligible. I worried that adding network firewall to the egress path would cost noticeable latency. It doesn't — added <1ms. AWS does the work efficiently.

Application teams got faster, not slower. Counterintuitive. The clearer rules made it easier for app teams to know "what do I need to add to make my service talk to redis" — they look at how other services do it, copy the pattern, done. Before the cleanup, the answer was "ask the platform team" and that was a queue.

What I'd tell a team starting #

Start with subnet tiering. It's structural and adding it later is the most painful part of the refactor. Public, app, data — three tiers. Most cases fit one of those.

Use security group references, not CIDRs. Even if you have a small fleet today, the membership-based references age much better.

Pick whether you'll use a network firewall for egress filtering early. If yes, the routing has to be set up to direct traffic through it from day one. If no, you can add it later but the data-tier-no-egress pattern (no internet at all) is still worth adopting.

The temptation is to defer security and "do it later when we have time." Network architecture doesn't refactor cleanly later. The earlier the segmentation goes in, the less it costs.

Best Practices: Cloud Networking Segmentation Patterns

Cloud Networking Segmentation Patterns

The principles #

Pattern 1: Tier separation by subnet #

Pattern 2: Security groups as the primary control #

Pattern 3: VPC peering with explicit traffic acceptance #

Pattern 4: PrivateLink for cross-account services #

Pattern 5: Egress firewall (AWS Network Firewall) for outbound control #

What this combination buys us #

What we monitor #

Common patterns we don't use #

What we'd do differently #

What surprised us #

What I'd tell a team starting #

Stay Updated

Systemd Tricks We Use to Keep Services Boring

How We Stopped Terraform Drift from Surprising On-Call

More from Cloud

External Secrets Operator: One Secrets Workflow Across Clouds

AWS Graviton Migration: What Broke and What We Saved

Serverless Cold Starts: Measuring and Fixing Them on Lambda

External Secrets Operator: One Secrets Workflow Across Clouds

AWS Graviton Migration: What Broke and What We Saved

Serverless Cold Starts: Measuring and Fixing Them on Lambda

Multi-Region Failover with Route 53: Health Checks and Gotchas

NAT Gateway Costs: The Silent Line Item and How to Cut It

Terraform Import at Scale: Bringing Legacy Infra Under Code

You might have missed

GitOps with Argo CD: Best Practices for 2025

Prompt Engineering Best Practices: Maximizing LLM Performance

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025

About Kiril Urbonas