VPCs, subnets, route tables, gateways. The mental model that finally made cloud networking click after I stopped trying to map it 1:1 to physical networks.
I came to cloud networking after years of working with physical networks. The vocabulary is familiar — VLAN, subnet, route table, NAT — but the mapping isn't 1:1. The first few years I kept tripping on small differences. This is the mental model I wish I'd had at the start.
In a physical network, the wires define connectivity. Computers connected to the same switch are on the same broadcast domain by default; routing between switches requires a router; firewalls are physical or virtual appliances inserted in the path.
In a cloud VPC, the routing tables define connectivity. There are no wires. Two VMs can be in adjacent IP ranges and totally unable to reach each other if the route tables and security groups don't allow it. They can be in different regions and reach each other instantly via VPC peering. The physical layout doesn't matter; the configuration does.
Once that clicked for me, everything else got easier.
A VPC is a private network. It has:
10.0.0.0/16) and optionally secondary blocksA VPC is regional. You can't span a VPC across regions; for that you peer VPCs together.
We pick /16 CIDRs (65k addresses) for new VPCs by default. /16 is overkill for most workloads but you can't expand a /20 to a /16 without surgery later. Picking too big costs nothing; picking too small costs a migration.
A subnet is a slice of the VPC's CIDR, bound to a single Availability Zone. You can't have a subnet that spans AZs.
For a regional service that needs HA across 3 AZs, you need at least 3 subnets — one per AZ. The application is deployed once per AZ, into the subnet of that AZ.
Subnet CIDR sizing: each AWS subnet reserves 5 IPs (network, broadcast, AWS reserved). A /24 (256 addresses) gives you 251 usable. If your subnet hosts containers (each pod gets an IP under EKS's default CNI), you need bigger subnets. We use /22 (1024 addresses) for EKS subnets, /24 for VM-only subnets.
The "public" vs "private" distinction is just about routing:
0.0.0.0/0) pointing to the internet gateway. Resources here can reach the internet directly. They also have public IPs (or get one when assigned).There's no flag on a subnet that says "public" or "private." It's a description of how the route table is configured.
Our standard layout per VPC:
A route table contains rules:
10.0.0.0/16 local # the VPC's own CIDR
0.0.0.0/0 nat-gateway-id # default → NAT
10.50.0.0/16 transit-gateway-id # peer VPC → transit gateway
When traffic leaves an instance, the route table is consulted longest-prefix-first. Traffic to the VPC's own CIDR stays local. Traffic to a peered VPC's range goes to the peering gateway. Everything else hits the default.
Each subnet associates with one route table. Multiple subnets can share a route table; that's how all 3 private subnets share "send 0.0.0.0/0 to a NAT."
An internet gateway (IGW) is what makes a public subnet "public." It's a logical attachment to the VPC. In the route table for a public subnet, 0.0.0.0/0 points to the IGW.
Caveat: even with an IGW, an instance needs a public IP (or Elastic IP) to be reachable from the internet. The IGW does the NAT translation between the instance's private IP and its public IP.
There's exactly one IGW per VPC. You don't create multiple IGWs for redundancy; the IGW itself is HA within the region.
A NAT gateway lets private-subnet instances reach the internet (for software updates, API calls to external services) without being reachable inbound.
NAT gateways live in public subnets and have an Elastic IP. Private subnets' route tables point 0.0.0.0/0 at the NAT gateway.
Two cost gotchas:
The NAT bill is a frequent target for cost optimization. Common moves:
We discovered one of our services was downloading a 1GB ML model from S3 once per pod startup. The traffic was going through NAT. Switching to S3 VPC endpoint cut $400/month off the bill.
VPC endpoints give private connectivity to AWS services without going through the internet (or NAT). Two flavors:
We use gateway endpoints universally (free, no reason not to). Interface endpoints are a per-service decision based on traffic volume.
Two firewalls operate at different layers:
Security groups (SG): stateful, per-ENI. "Allow inbound port 443 from SG-of-the-load-balancer." Stateful means return traffic is automatically allowed.
Network ACLs (NACL): stateless, per-subnet. Both directions must be allowed explicitly. Returns must be allowed via ephemeral port range.
We use security groups for almost everything. NACLs are used only as a coarse defense-in-depth layer (e.g., deny all SSH at the NACL level for non-bastion subnets, just in case a security group is misconfigured).
Most of our security group complexity is solved by SG-references rather than IP-based rules. Example: "the database SG allows port 5432 from the app SG." When we scale the app, new instances get the app SG and automatically have access. No CIDR updates needed.
Once you have multiple VPCs (which you will — separate accounts, separate environments), connecting them is a separate problem:
VPC peering: 1:1 connections between two VPCs. Simple but doesn't scale — for 5 VPCs, you have 10 peerings, all of which need route table entries everywhere.
Transit gateway (TGW): hub-and-spoke. Each VPC connects to the TGW; routing happens centrally. Costs more (per-attachment hourly + per-GB processed) but vastly simpler at scale.
PrivateLink: exposes a specific service from one VPC to another, without full network peering. Used for service-provider scenarios.
We standardized on TGW once we hit 4 VPCs. The TGW costs are real but the alternative — a full mesh of peerings — was unmaintainable.
Mistakes I've made or seen:
Subnets too small. A /27 (32 IPs, 27 usable after AWS reservations) seems fine for "a few servers" but EKS pods eat IPs fast. Use bigger subnets than you think you need.
Overlapping CIDRs across VPCs. Two VPCs both use 10.0.0.0/16? They can't be peered, ever. Without overlap-prevention up front, peering becomes "we need to renumber an entire VPC." We standardized on /16 blocks per VPC from a registered /12 range; overlap is impossible by design.
Forgetting NACLs are stateless. Adding an inbound rule but not the corresponding ephemeral outbound rule. Symptoms: connections establish but data doesn't flow. We mostly avoid by leaving NACLs at defaults and using security groups for filtering.
NAT gateway in the wrong AZ. A private subnet in AZ-a with a route to a NAT gateway in AZ-b is technically fine but adds cross-AZ data charges. We pair NAT gateways with route tables per-AZ.
Security group sprawl. Hundreds of security groups, half unused, naming inconsistent. We have a quarterly cleanup that drops orphaned SGs.
When traffic doesn't flow, I work through this checklist:
VPC Flow Logs are the bottom-of-stack tool. They record every packet's accept/reject decision per ENI. Querying them for "SOURCE_IP DEST_IP" tells you which layer is dropping. Slow but authoritative.
Stop trying to map cloud networking to physical networking. The concepts overlap but the mental model is "configuration defines connectivity," not "wires define connectivity."
Plan CIDRs before you need them. Pick a /12 for your org, allocate /16s per VPC from it, never overlap. This decision is hard to undo.
NAT gateway costs sneak up. Watch the GB-processed charges; that's where the surprises live. Use VPC endpoints for AWS service traffic.
Use security groups, not NACLs. Stateful + ENI-scoped + SG-references = the right tool for almost everything.
Read VPC Flow Logs at least once. When you next have a connectivity issue, query them. The first time you do this, your ability to debug improves a lot.
Cloud networking isn't conceptually harder than physical networking; it's just different in ways that surprise people who came from the physical world. Once the mental model is right, everything else falls into place.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
Practical game day scenarios for CI/CD: broken rollbacks, permission issues, and slow feedback loops—and how we fixed them.
How a small team moved from single-region risk to a simple active/passive multi-region setup without doubling complexity.
Explore more articles in this category
There are two hard problems in computer science." We've worked on the cache-invalidation one for a while. The patterns that hold up at scale and the ones that look clean and aren't.
We use Step Functions for batch processing, document ingestion, and a few agentic workflows. The patterns that work, the limits we hit, and where we'd reach for something else.
After two years of running Karpenter on production EKS clusters, the NodePool patterns that survived, the ones we replaced, and the tuning that matters.
Evergreen posts worth revisiting.