A systematic approach to debugging Linux network issues. The tools that earn their place and the order I use them in.
When something on the network is broken, panic-running random tools is a tempting and unproductive approach. After enough production debugging, I've landed on a systematic order: layer by layer, simplest checks first. This post is that order, with the tools and what they tell you.
Network debugging works best when you check from the bottom up:
Most network issues are at one layer. Check each in order; the answer surfaces.
For cloud / containerized work, "physical layer" is virtual but the same concept applies — virtual NICs, virtual networks, etc.
ip link show
Shows network interfaces and their state. Look for UP and LOWER_UP flags:
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 ...
UP means the OS thinks it's up. LOWER_UP means the link layer (cable / virtual link) is up. If LOWER_UP is missing, no carrier — physical issue.
If the interface isn't there at all, kernel module or hardware issue.
ethtool eth0 shows link speed, duplex, link partner. Useful for physical debugging (less so for cloud / virtual).
ip addr show
Shows IP addresses on each interface. Confirm the right IP is on the right interface.
ip route show
Shows the routing table. The default route is usually default via <gateway> dev <interface>. If the default is missing or wrong, packets to the internet have nowhere to go.
ip neigh show # ARP table
Maps IPs to MAC addresses on the local network. If you can't reach a host on the same subnet, check if its ARP entry is REACHABLE. If STALE or FAILED, the host might not be responding to ARP.
ping <ip>
The classic. Works if the destination responds to ICMP and routing is correct. Some hosts block ICMP, so ping failing doesn't always mean connectivity is broken — but ping working confirms basic L3 connectivity.
traceroute <host>
# or
mtr <host> # better, runs continuously
Shows the path packets take. Useful for identifying where in the network the problem is. If traceroute stops at a specific hop, that's where the problem starts.
mtr (My Traceroute) is the better tool — it sends packets continuously and shows packet loss per hop. Lets you spot a flaky middle link.
ss -tlnp # Listening TCP ports with process info
ss -tunap # All TCP+UDP, listening and connected
ss is the modern replacement for netstat. Shows what's listening and what connections are open.
For testing if a port is reachable from somewhere:
nc -zv <host> <port> # Test if port is open
curl -v https://host:port/ # Test HTTPS
telnet <host> <port> # Old-school, still works
nc -zv is great for quick port-open testing.
For looking at actual TCP connections:
ss -tn state established # Established connections
ss -ti # With detailed TCP info
The -i shows congestion window, RTT estimates, retransmits, etc. — diagnostic gold for slow connections.
If basic connectivity works but the application doesn't:
dig <hostname> # DNS lookup
dig +trace <hostname> # Full resolution path
dig is the right tool for DNS debugging. nslookup works but is less informative. host is even simpler.
For checking specific resolvers:
dig @8.8.8.8 example.com
dig @<resolver-ip> example.com
Useful for debugging "why does this work from outside the cluster but not inside" (different resolvers).
For TLS:
openssl s_client -connect host:443 -servername host
Shows the TLS handshake, cert chain, etc. -servername is important for SNI.
Common TLS problems: expired cert, wrong intermediate cert, hostname mismatch. openssl s_client shows all of these clearly.
For HTTP:
curl -v https://example.com/
curl -v --resolve example.com:443:1.2.3.4 https://example.com/
-v shows the full request/response including headers. --resolve lets you bypass DNS to test against a specific IP.
After enough network debugging, certain patterns recur:
"Connection times out" but not "connection refused". Usually a firewall (security group, network ACL, or host firewall) silently dropping packets. Connection refused means the host is reachable but the port is closed; timeout means the host isn't reachable or the firewall is dropping.
"Connection refused". The service isn't listening on the port, OR it's listening on a different interface (e.g., 127.0.0.1 only). Check ss -tlnp on the destination.
"Name or service not known". DNS is broken. Check /etc/resolv.conf, the resolver itself, etc. Common in containers when DNS config is wrong.
Intermittent failures. TCP retransmits, packet loss, MTU issues. ss -ti shows retransmit counts; mtr shows packet loss per hop.
Slow connections. Often DNS — every connection has a DNS step. dig @<resolver> to see resolver latency. Sometimes it's a slow firewall (deep packet inspection on a hot path).
SSL/TLS handshake failures. Cert mismatches, protocol version mismatches, cipher mismatches. openssl s_client shows the handshake stage that fails.
For containers and Kubernetes, every container has its own network namespace. Tools like ip, ss, etc. show the current namespace's state.
To check inside a container's namespace from the host:
nsenter -t <pid> -n ip addr
nsenter -t <pid> -n ss -tlnp
Or with Docker:
docker exec <container> ss -tlnp
Or with kubectl:
kubectl exec <pod> -- ss -tlnp
A common mistake: running ss on the host and being confused that you don't see the container's listening port. Different namespace; need to enter it.
When you really need to know what's on the wire:
tcpdump -i any host 1.2.3.4 -nnvv
Captures packets matching the filter, prints them. Useful for:
tcpdump -i any port 443 -w out.pcap
Capture to a file; open in Wireshark for analysis. Wireshark's UI is much better for non-trivial inspection.
For HTTPS traffic, decryption requires the private key (server side) or session keys (client side, less commonly accessible). Plain HTTP can be inspected directly.
VPC Flow Logs. AWS records every packet's accept/reject decision per ENI. Slow but authoritative. Query in CloudWatch Insights or Athena.
Security group reachability test. AWS has a feature ("Reachability Analyzer") that tells you if a connection between two ENIs is allowed by the security groups + NACLs + route tables. Saves a lot of debugging.
Network policies in Kubernetes. When pod-to-pod traffic is blocked, the CNI's NetworkPolicy enforcement is often the cause. Check the active policies.
ip link show veth* to see container's host-side interface. Useful for checking traffic counters: ip -s link show veth123abc.
Conntrack table fullness. cat /proc/sys/net/netfilter/nf_conntrack_count. If close to nf_conntrack_max, new connections are dropping.
Specific checklist:
dig with timing.mtr, ss -ti.iperf3 between source and destination.The order: cheap and fast first; harder later.
A few tools that are less relevant than they used to be:
netstat: replaced by ss. ss is faster and gives more info.
ifconfig: replaced by ip. ip is more flexible.
route / route add: replaced by ip route.
arp -a: replaced by ip neigh.
The new tools are part of iproute2. Worth learning if you've been using the old ones for years.
Layer-by-layer beats random. Check L1 → L2 → L3 → ... in order. Most issues live at one layer.
ss over netstat. ip over ifconfig. Modern tools, more useful output.
Read manpages. man ip-route, man ss. They're not Wikipedia articles; they're terse but informative.
mtr is your friend for "where in the path is the problem."
tcpdump for "is the packet actually being sent." When other tools don't agree, look at the wire.
Cloud has additional tools. VPC Flow Logs, Reachability Analyzer, etc. Use them.
Network namespaces matter. When working with containers, remember which namespace you're in.
Linux network debugging has the advantage of being well-tooled and well-documented. The tools are old, stable, and understood. The patterns are layered. Most issues yield to a systematic approach. The skill is in working the layers methodically rather than guessing — which is what makes the difference between 30 minutes of debugging and 3 hours.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
Practical game day scenarios for CI/CD: broken rollbacks, permission issues, and slow feedback loops—and how we fixed them.
A condensed checklist of the systemd unit-file patterns we now use everywhere, with the production reasons each one matters.
Explore more articles in this category
We migrated most scheduled jobs from cron to systemd timers. The wins, the gotchas, and the cases we kept on cron anyway.
A curated list of shell one-liners that earn their place in real ops work — the ones I reach for weekly, not the trick-shot variety.
Generate an SSH key, set up passwordless login, and configure aliases for the servers you use daily — all without copy-pasting yet another long command.