When the service is slow and the network is suspect, these are the tools we reach for, in this order, with the exact flags that find the answer.
When a service is slow and the network is the suspect, you reach for tools. The good news is there are only a few you really need. The bad news is the man pages are long and the right flags are buried. This is a field-notes guide to the commands we run, in the order we run them, when we're trying to figure out why something is broken.
Tools covered: ss, tcpdump, bpftrace, iftop, mtr. Not netstat (deprecated; ss is faster and more accurate).
ss — the connections view#ss (socket statistics) replaces the old netstat. First questions on any latency issue: how many connections do we have, are any in a weird state, are queues backed up?
Connection counts by state.
ss -tan state established | wc -l
ss -tan state time-wait | wc -l
ss -tan | awk 'NR>1 {print $1}' | sort | uniq -c
A huge number of TIME-WAIT (tens of thousands) often points at short-lived connections — every closed connection sits in TIME-WAIT for ~60s. Solutions: connection reuse on the client (keepalive, pooling), or net.ipv4.tcp_tw_reuse = 1 on the kernel side.
A huge number of CLOSE-WAIT is worse — it means the local side hasn't closed sockets that the peer has closed. Usually an application bug (forgot to close the socket).
Listen queue depth.
ss -ltn
The Recv-Q column for a listening socket = current backlog (connections waiting to be accepted). The Send-Q column = configured backlog (somaxconn cap). If Recv-Q is consistently near Send-Q, your application isn't calling accept() fast enough — usually an event loop issue or thread starvation.
Per-socket details.
ss -tin
The -i flag adds TCP-level info: RTT, congestion window, retransmits, RTO. A connection with high retrans is having packet loss; one with high RTT is geographically distant or the network is slow.
tcpdump — the packet view#When ss says something is wrong but doesn't tell you what, tcpdump does. Capture, analyze in Wireshark.
Capture to a file (don't try to read live — too noisy).
tcpdump -i any -w /tmp/cap.pcap -s 0 'host <peer-ip> and port <port>'
-s 0 = capture full packet (not just headers). -w writes binary, much faster than printing. Always filter by host + port; capturing everything makes the file unusable.
For TLS-encrypted traffic you can still see TCP-level details (handshake, retransmits, RTT) — you just can't read the payload.
Reading live (for sanity checks).
tcpdump -i any -nn -A 'host <peer> and port 80'
-nn = no DNS/port resolution (faster). -A = ASCII payload (useful for HTTP).
Common scenarios we capture for:
tcpdump is too coarse: bpftrace#tcpdump shows packets. Sometimes you need to know what the kernel is doing with those packets. bpftrace runs eBPF programs from one-liners.
Find which processes are opening connections to a target.
bpftrace -e 'tracepoint:syscalls:sys_enter_connect /args->uaddr->sa_family == 2/ {
printf("%s -> %d.%d.%d.%d\n", comm,
args->uaddr->sa_data[2], args->uaddr->sa_data[3],
args->uaddr->sa_data[4], args->uaddr->sa_data[5]);
}'
(Cleaner with the official tcpconnect tool from bpfcc-tools, but the inline version works without dependencies.)
Histogram of TCP round-trip times.
bpftrace -e 'kprobe:tcp_rcv_established { @[comm] = hist((nsecs - @start[arg1])/1000000); }'
This gives you per-process RTT distribution. Useful for "this service has weird latency to its dependencies."
Drop tracing.
bpftrace -e 'kprobe:kfree_skb { @[kstack] = count(); }'
Captures every dropped packet by kernel stack. Run for 30s, see the top drop sites. Usually points at firewall rules, queue overflow, or netfilter.
The bcc-tools package has dozens of pre-built tools for common needs: tcptracer, tcpretrans, tcptop, tcpsubnet. We start with these before writing custom bpftrace.
iftop#When you want a live view of bandwidth by connection:
iftop -i eth0 -n
Interactive view, sorted by current bandwidth. Useful for "who is saturating my NIC?" Real answer: the egress on a streaming endpoint, the rsync that someone started during business hours, the runaway log shipper.
mtr#When the question is "is this a us-or-them problem?", mtr shows the hop-by-hop path with packet loss and latency.
mtr -rwn -c 100 <target>
-r = report mode (run, then exit). -w = wide output. -c 100 = 100 packets. The output shows each hop and what percentage of packets it lost. A high-loss hop in the middle of the path is the network's fault; high loss at the destination is yours.
We use this when escalating to a cloud provider — "your network is dropping at hop X" is much more actionable than "the network is slow."
For a "service is slow" report:
ss -tin on the source. Are connections established? RTT high? Retransmits?ss -ltn on the destination. Is the listen queue full?tcpdump between source and destination. Capture for 30 seconds; look at the pattern.mtr if cross-region. Is there path loss?bpftrace / tcptop / tcptracer if the application-level symptoms don't match what packets show.tcpdump on a busy server can drop packets. The tool itself can't keep up with kernel-level traffic. Use -B (kernel buffer size) and capture to a fast disk. If still dropping, capture on a specific interface or filter aggressively. Dropped tcpdump packets show up as "first one is fine, then traces look broken" — confusing.
tcpdump with TLS reveals less than you think. You see the handshake, sizes, timing. You don't see payload. Use mitmproxy or capture on the unencrypted side (load balancer's internal hop) for content visibility.
Conntrack table fills up. On heavy-traffic gateways, the netfilter connection-tracking table can fill, causing connection drops. cat /proc/sys/net/netfilter/nf_conntrack_count vs nf_conntrack_max. If close, bump the max or skip conntrack for the relevant traffic.
Reverse path filter. rp_filter = 1 (the default) drops packets that arrive on an interface that wouldn't route back to the source. Bites in multi-NIC setups; an asymmetric route silently drops. Check /proc/sys/net/ipv4/conf/all/rp_filter.
MTU mismatches. When packets are dropped silently on certain sizes (large requests hang, small requests succeed), suspect MTU. The classic culprit is a VPN/tunnel reducing MTU below the path. ping -M do -s 1472 <target> to probe.
Network debugging rewards calm and a systematic approach. The kernel has been doing this for decades and is usually right about what it's seeing. The tools above expose what the kernel knows; the discipline is reading the output without jumping to conclusions. Most of the time the answer is in ss or tcpdump; the eBPF tools are for the harder remainder.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
Token caching, model routing, prompt compression, and the boring discipline of measuring. The levers that cut our LLM bill 60% without touching feature scope.
Three discounting mechanisms, three different commitments. The rules of thumb we use to pick, and the mistakes we made before settling on them.
Explore more articles in this category
io_uring replaces epoll for new high-throughput services. The patterns that earn their place, the gotchas in older kernels, and where we'd still pick epoll.
cpu.shares vs cpu.cfs_quota_us vs memory.max — the cgroup mechanics behind Kubernetes resource limits, and the surprises that explain the weird symptoms you've seen.
bpftrace one-liners replace strace, perf top, and a half-dozen ad-hoc debugging scripts. The patterns that actually earn their place when you're troubleshooting at 2 AM.