How a packet actually gets from the internet to a pod, walked layer by layer. Plus the things that surprise people the first time they hit them.
Most "Kubernetes networking" content stays high-level. You learn that pods have IPs, services have cluster IPs, ingress routes external traffic — and then in production you hit a problem and none of that is detailed enough to debug. This is the version I wish I'd had when I was learning: where each piece actually lives, what touches each packet, and where the surprises are.
The flat-network promise: every pod has its own IP address. Any pod can reach any other pod's IP without NAT. Pods see their own IP the same way other pods see them.
This isn't magic; it's implemented by a CNI (Container Network Interface) plugin. Common options: Cilium, Calico, AWS VPC CNI, Flannel. They all give you the flat network; they differ in how they implement it.
The flat network means cross-pod traffic doesn't traverse the kube-proxy layer. Pod A talking directly to Pod B's IP is a single network hop (whatever the CNI implements — usually Linux routing or BPF), no Kubernetes service object involved.
A Service is a stable virtual IP (ClusterIP) that load-balances to a set of pods. The pods are selected by label.
When you call service-name.namespace.svc.cluster.local, DNS resolves to the ClusterIP. Then... what?
The answer depends on kube-proxy's mode:
iptables mode (default for many years): kube-proxy programs iptables rules on every node. Each rule has random-probability DNAT to one of the backend pods. Traffic to the ClusterIP gets DNATed to one specific pod IP at the kernel level, then routed normally.
IPVS mode: similar but uses Linux IPVS instead of iptables. Better performance with many services (iptables is O(n) per rule check, IPVS uses hash tables).
eBPF mode (Cilium): no kube-proxy needed. eBPF programs at socket / TC level do the DNAT directly.
For most clusters under ~5,000 services, iptables mode is fine. Above that, IPVS or eBPF starts mattering.
The thing to understand: ClusterIP isn't a real network address. It exists only inside the cluster, in the kernel's iptables/ipvs/bpf state. Outside the cluster, ClusterIPs are unreachable.
Three Service types for exposure:
ClusterIP: internal only. Default.
NodePort: opens a port (30000-32767 by default) on every node. External traffic can hit <any-node-ip>:<nodeport> and gets forwarded to the service. Mainly useful for development; rarely the right answer in production.
LoadBalancer: provisions a cloud load balancer (AWS NLB/ALB, GCP LB, etc.) that points at the NodePort. The cloud LB is a real public-internet endpoint. This is what most production services use for north-south traffic.
A subtle thing: LoadBalancer Services are usually L4 (TCP). For HTTP routing (host-based, path-based), you want Ingress, which is L7.
Ingress is a higher-level abstraction. An Ingress resource defines rules: "host=api.example.com, path=/v1/* → service=api-v1; path=/v2/* → service=api-v2." An Ingress Controller (nginx-ingress, traefik, etc.) reads these rules and configures itself.
The Ingress Controller is just a regular pod (well, a Deployment) running an HTTP proxy. It receives traffic via a LoadBalancer Service of its own. The flow:
So a request from the internet to a pod traverses: cloud LB → ingress controller → backend Service (kube-proxy DNAT) → backend pod. Three "hops" in terms of L7/L4 boundaries.
For TLS termination, the ingress controller usually does it (with certs from cert-manager).
service-name.namespace.svc.cluster.local resolves#Inside a pod, /etc/resolv.conf looks like:
nameserver 10.96.0.10
search default.svc.cluster.local svc.cluster.local cluster.local
options ndots:5
The nameserver is the cluster's DNS service (CoreDNS by default), at the ClusterIP 10.96.0.10. CoreDNS pods know about every Service in the cluster and answer A/AAAA queries.
The search list means service-name (no dots) gets searched as service-name.default.svc.cluster.local, then service-name.svc.cluster.local, etc. This is why short names work inside the cluster but not outside.
ndots:5 is a famous cause of latency. With ndots=5, a query like api.external.com (3 dots) gets searched against the cluster suffixes first (returning NXDOMAIN), then finally the actual external DNS. We've seen ~5ms added to every external DNS lookup because of this.
Mitigation: use FQDNs (with trailing dot — api.external.com.) for external lookups, or set ndots:1 in pod DNS config.
By default, the flat network means any pod can reach any other pod. NetworkPolicies restrict this:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: api-allow-frontend
namespace: production
spec:
podSelector:
matchLabels:
app: api
ingress:
- from:
- podSelector:
matchLabels:
app: frontend
ports:
- port: 8080
protocol: TCP
This says: the api pod accepts TCP/8080 only from pods labeled frontend. Everything else is denied.
Important: NetworkPolicies are implemented by the CNI. Some CNIs (Flannel by default) don't support them. Make sure your CNI does (Cilium, Calico, the various managed K8s CNIs do).
Default deny is also worth setting up. By default, pods accept all traffic. We deploy a default-deny NetworkPolicy in every namespace and require teams to explicitly allow what they need:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny
spec:
podSelector: {}
policyTypes:
- Ingress
- Egress
This is a real security improvement — without it, a compromised pod can connect anywhere in the cluster.
Almost every service-to-service call inside the cluster goes through DNS. CoreDNS is therefore on the hot path for everything.
Things that have hurt us:
CoreDNS pod restarts during high-load periods cause a temporary spike in DNS failures. We added PodDisruptionBudgets and increased the replica count.
The 5-second timeout in glibc's getaddrinfo means a DNS query that times out blocks the calling thread for 5 seconds. With high request volume, blocked threads cascade into pod-level latency. We added dnsConfig settings to lower the timeout.
ndots:5 causing all external lookups to first try cluster suffixes. Mentioned above. The fix is real and we've seen ~10% latency improvement on services with heavy external API calls.
CoreDNS metrics are essential. We alert on:
Once you add a service mesh (Istio, Linkerd), traffic flow gets another layer:
Pod A → sidecar A (proxy) → network → sidecar B (proxy) → Pod B
The mesh sidecars handle mTLS, observability, retries. They don't replace Services — they augment them. A request still goes through DNS resolution and (depending on mesh) might still hit kube-proxy's DNAT.
This adds 1-2ms of latency typically. For most services, fine. For latency-critical services, sometimes you exclude them from the mesh.
A user clicks "load orders" in our app. The request flow:
api.example.com → returns ALB IPhost=api.example.com, path=/orders/* → forwards to orders-service.production.svc.cluster.localdb.production.svc.cluster.localAbout 5 layers of proxying/redirection in this flow. Each adds a few hundred microseconds. Total overhead vs "direct connection": ~3-5ms. Worth it for the operational benefits.
ClusterIP isn't reachable from outside the cluster. Obvious in retrospect; not obvious when you first try to debug from a non-cluster machine. Use NodePort or port-forward.
Conntrack table overflow. Linux's connection tracker has a default limit (~262k entries). High-throughput services can fill this, after which new connections get dropped. We bumped net.netfilter.nf_conntrack_max to 2M on busy nodes.
Source IP loss. When traffic flows through a Service (kube-proxy DNAT), the source IP is replaced with the node IP by default. If your app needs the real client IP, set Service.spec.externalTrafficPolicy: Local (which preserves source IP but has its own tradeoffs — only nodes with backing pods receive traffic).
ARP table overflow on big clusters. Each pod IP needs an ARP entry on the nodes that see it. Above ~10k pods per cluster, you can hit ARP cache limits. We bumped net.ipv4.neigh.default.gc_thresh3 accordingly.
Cross-AZ data charges. Pod-to-pod traffic across AZs costs cloud-provider data charges. For high-traffic services, this adds up. We use topology.kubernetes.io/zone topology spread constraints + topology-aware service routing to keep traffic within an AZ when possible.
When traffic doesn't flow, the layers to check:
kubectl exec into source pod, curl http://<dest-pod-ip>:port. If this fails, it's a CNI/NetworkPolicy issue.nslookup dest-service. If this fails, it's a CoreDNS issue.curl http://<cluster-ip>:port. If this fails, it's kube-proxy.This top-down approach finds the broken layer in 4 steps. Don't try to debug at multiple layers at once.
Read your CNI's docs. The flat-network abstraction is implemented differently by each CNI. Knowing which one you're running and how it works (especially around encapsulation / source-NAT behavior) is essential.
Set NetworkPolicies as default-deny. It's a real security improvement and forces explicit traffic flow documentation. Yes, it's annoying upfront. Yes, it's worth it.
Watch CoreDNS metrics. It's on the hot path for everything. When it's slow, your whole cluster is slow.
Use FQDNs for external lookups inside pods. ndots:5 + cluster suffixes will burn cycles otherwise.
Don't expose ClusterIPs to the outside world. They're internal abstractions. Use Ingress for external HTTP, LoadBalancer Services for external L4.
K8s networking has more layers than most people expect. Most of the time, it works. When it doesn't, the layers help: you can localize the problem to the right one and fix it. That's the value of the model — not that it's simple, but that it's debuggable.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
Concrete systemd unit patterns that reduced flakiness: restart policies, resource limits, and structured logs.
A real story of removing console-only changes, adding drift detection, and getting Terraform back in charge.
Explore more articles in this category
You always have known vulnerabilities. The question is how you triage, patch, and respond. The discipline we run after a few real incidents and a lot of routine work.
Three terms that get mixed up constantly. The actual differences, where each one sits in the request path, when you reach for which, and where the same tool plays all three roles.
Helm gives you a lot of rope. The patterns we used that backfired, the ones we replaced them with, and what to skip if you're starting today.