VPCs, subnets, route tables, gateways. The mental model that finally made cloud networking click after I stopped trying to map it 1:1 to physical networks.

On this page

Cloud Networking Fundamentals: VPCs, Subnets, and Routing

I came to cloud networking after years of working with physical networks. The vocabulary is familiar — VLAN, subnet, route table, NAT — but the mapping isn't 1:1. The first few years I kept tripping on small differences. This is the mental model I wish I'd had at the start.

The fundamental mental shift #

In a physical network, the wires define connectivity. Computers connected to the same switch are on the same broadcast domain by default; routing between switches requires a router; firewalls are physical or virtual appliances inserted in the path.

In a cloud VPC, the routing tables define connectivity. There are no wires. Two VMs can be in adjacent IP ranges and totally unable to reach each other if the route tables and security groups don't allow it. They can be in different regions and reach each other instantly via VPC peering. The physical layout doesn't matter; the configuration does.

Once that clicked for me, everything else got easier.

VPC: the boundary #

A VPC is a private network. It has:

A primary CIDR block (e.g., 10.0.0.0/16) and optionally secondary blocks
Subnets that carve up the CIDR
Route tables that say where traffic goes
Network ACLs and security groups that filter

A VPC is regional. You can't span a VPC across regions; for that you peer VPCs together.

We pick /16 CIDRs (65k addresses) for new VPCs by default. /16 is overkill for most workloads but you can't expand a /20 to a /16 without surgery later. Picking too big costs nothing; picking too small costs a migration.

Subnets: per-AZ slices #

A subnet is a slice of the VPC's CIDR, bound to a single Availability Zone. You can't have a subnet that spans AZs.

For a regional service that needs HA across 3 AZs, you need at least 3 subnets — one per AZ. The application is deployed once per AZ, into the subnet of that AZ.

Subnet CIDR sizing: each AWS subnet reserves 5 IPs (network, broadcast, AWS reserved). A /24 (256 addresses) gives you 251 usable. If your subnet hosts containers (each pod gets an IP under EKS's default CNI), you need bigger subnets. We use /22 (1024 addresses) for EKS subnets, /24 for VM-only subnets.

Public vs private subnets #

The "public" vs "private" distinction is just about routing:

Public subnet: route table contains a default route (0.0.0.0/0) pointing to the internet gateway. Resources here can reach the internet directly. They also have public IPs (or get one when assigned).
Private subnet: route table's default route points to a NAT gateway (or has no default route at all). Resources can reach out but can't be reached from the internet directly.

There's no flag on a subnet that says "public" or "private." It's a description of how the route table is configured.

Our standard layout per VPC:

3 public subnets (one per AZ) — for ALBs, NAT gateways
3 private subnets — for application workloads
3 isolated subnets — for databases (no NAT, no internet)

Route tables: the actual logic #

A route table contains rules:

code

10.0.0.0/16     local                # the VPC's own CIDR
0.0.0.0/0       nat-gateway-id       # default → NAT
10.50.0.0/16    transit-gateway-id   # peer VPC → transit gateway

When traffic leaves an instance, the route table is consulted longest-prefix-first. Traffic to the VPC's own CIDR stays local. Traffic to a peered VPC's range goes to the peering gateway. Everything else hits the default.

Each subnet associates with one route table. Multiple subnets can share a route table; that's how all 3 private subnets share "send 0.0.0.0/0 to a NAT."

Internet gateway: the actual edge #

An internet gateway (IGW) is what makes a public subnet "public." It's a logical attachment to the VPC. In the route table for a public subnet, 0.0.0.0/0 points to the IGW.

Caveat: even with an IGW, an instance needs a public IP (or Elastic IP) to be reachable from the internet. The IGW does the NAT translation between the instance's private IP and its public IP.

There's exactly one IGW per VPC. You don't create multiple IGWs for redundancy; the IGW itself is HA within the region.

NAT gateway: the costly part #

A NAT gateway lets private-subnet instances reach the internet (for software updates, API calls to external services) without being reachable inbound.

NAT gateways live in public subnets and have an Elastic IP. Private subnets' route tables point 0.0.0.0/0 at the NAT gateway.

Two cost gotchas:

NAT gateways are charged hourly (~~$0.045/hr) and per GB processed (~~$0.045/GB). For high-throughput workloads, the GB charges dominate. We've had services that egress 5TB/day; the NAT bill was $5,000+/month for them alone.
For HA, you want one NAT gateway per AZ. So 3 AZs = 3 NAT gateways = $100+/month base before any traffic.

The NAT bill is a frequent target for cost optimization. Common moves:

VPC endpoints for AWS services (S3, DynamoDB, ECR, etc.) so traffic doesn't traverse NAT
Single NAT gateway in dev/staging (sacrifices AZ failover for cost)
Reviewing what's egressing — sometimes it's a misconfigured service

We discovered one of our services was downloading a 1GB ML model from S3 once per pod startup. The traffic was going through NAT. Switching to S3 VPC endpoint cut $400/month off the bill.

VPC endpoints: the other path to AWS services #

VPC endpoints give private connectivity to AWS services without going through the internet (or NAT). Two flavors:

Gateway endpoints: for S3 and DynamoDB only. Free. They add a route to the route table that sends traffic for those services to AWS's internal network.
Interface endpoints: for most other services (SSM, KMS, ECR, etc.). They create an ENI in your VPC. Cost ~$0.01/hr per AZ + per-GB charges (lower than NAT's per-GB).

We use gateway endpoints universally (free, no reason not to). Interface endpoints are a per-service decision based on traffic volume.

Security groups vs NACLs #

Two firewalls operate at different layers:

Security groups (SG): stateful, per-ENI. "Allow inbound port 443 from SG-of-the-load-balancer." Stateful means return traffic is automatically allowed.

Network ACLs (NACL): stateless, per-subnet. Both directions must be allowed explicitly. Returns must be allowed via ephemeral port range.

We use security groups for almost everything. NACLs are used only as a coarse defense-in-depth layer (e.g., deny all SSH at the NACL level for non-bastion subnets, just in case a security group is misconfigured).

Most of our security group complexity is solved by SG-references rather than IP-based rules. Example: "the database SG allows port 5432 from the app SG." When we scale the app, new instances get the app SG and automatically have access. No CIDR updates needed.

Cross-VPC connectivity #

Once you have multiple VPCs (which you will — separate accounts, separate environments), connecting them is a separate problem:

VPC peering: 1:1 connections between two VPCs. Simple but doesn't scale — for 5 VPCs, you have 10 peerings, all of which need route table entries everywhere.

Transit gateway (TGW): hub-and-spoke. Each VPC connects to the TGW; routing happens centrally. Costs more (per-attachment hourly + per-GB processed) but vastly simpler at scale.

PrivateLink: exposes a specific service from one VPC to another, without full network peering. Used for service-provider scenarios.

We standardized on TGW once we hit 4 VPCs. The TGW costs are real but the alternative — a full mesh of peerings — was unmaintainable.

Common networking mistakes #

Mistakes I've made or seen:

Subnets too small. A /27 (32 IPs, 27 usable after AWS reservations) seems fine for "a few servers" but EKS pods eat IPs fast. Use bigger subnets than you think you need.

Overlapping CIDRs across VPCs. Two VPCs both use 10.0.0.0/16? They can't be peered, ever. Without overlap-prevention up front, peering becomes "we need to renumber an entire VPC." We standardized on /16 blocks per VPC from a registered /12 range; overlap is impossible by design.

Forgetting NACLs are stateless. Adding an inbound rule but not the corresponding ephemeral outbound rule. Symptoms: connections establish but data doesn't flow. We mostly avoid by leaving NACLs at defaults and using security groups for filtering.

NAT gateway in the wrong AZ. A private subnet in AZ-a with a route to a NAT gateway in AZ-b is technically fine but adds cross-AZ data charges. We pair NAT gateways with route tables per-AZ.

Security group sprawl. Hundreds of security groups, half unused, naming inconsistent. We have a quarterly cleanup that drops orphaned SGs.

Debugging connectivity #

When traffic doesn't flow, I work through this checklist:

Source security group: does it allow outbound to the destination?
Destination security group: does it allow inbound from the source?
Route table: does the source's subnet have a route to the destination's CIDR?
NACL (both source and destination subnets): allow both directions?
DNS: is the source resolving the destination correctly? (Often the actual problem.)

VPC Flow Logs are the bottom-of-stack tool. They record every packet's accept/reject decision per ENI. Querying them for "SOURCE_IP DEST_IP" tells you which layer is dropping. Slow but authoritative.

What I'd tell someone learning #

Stop trying to map cloud networking to physical networking. The concepts overlap but the mental model is "configuration defines connectivity," not "wires define connectivity."

Plan CIDRs before you need them. Pick a /12 for your org, allocate /16s per VPC from it, never overlap. This decision is hard to undo.

NAT gateway costs sneak up. Watch the GB-processed charges; that's where the surprises live. Use VPC endpoints for AWS service traffic.

Use security groups, not NACLs. Stateful + ENI-scoped + SG-references = the right tool for almost everything.

Read VPC Flow Logs at least once. When you next have a connectivity issue, query them. The first time you do this, your ability to debug improves a lot.

Cloud networking isn't conceptually harder than physical networking; it's just different in ways that surprise people who came from the physical world. Once the mental model is right, everything else falls into place.

Cloud Networking Fundamentals: VPCs, Subnets, and Routing

Cloud Networking Fundamentals: VPCs, Subnets, and Routing

The fundamental mental shift #

VPC: the boundary #

Subnets: per-AZ slices #

Public vs private subnets #

Route tables: the actual logic #

Internet gateway: the actual edge #

NAT gateway: the costly part #

VPC endpoints: the other path to AWS services #

Security groups vs NACLs #

Cross-VPC connectivity #

Common networking mistakes #

Debugging connectivity #

What I'd tell someone learning #

Stay Updated

What We Learned Running Weekly Game Days on Our CI/CD Pipeline

A Pragmatic Multi-Region Strategy for Small Teams

More from Cloud

External Secrets Operator: One Secrets Workflow Across Clouds

AWS Graviton Migration: What Broke and What We Saved

Serverless Cold Starts: Measuring and Fixing Them on Lambda

External Secrets Operator: One Secrets Workflow Across Clouds

AWS Graviton Migration: What Broke and What We Saved

Serverless Cold Starts: Measuring and Fixing Them on Lambda

Multi-Region Failover with Route 53: Health Checks and Gotchas

NAT Gateway Costs: The Silent Line Item and How to Cut It

Terraform Import at Scale: Bringing Legacy Infra Under Code

You might have missed

GitOps with Argo CD: Best Practices for 2025

Prompt Engineering Best Practices: Maximizing LLM Performance

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025

About Kiril Urbonas