We started using eBPF tooling for ad-hoc production debugging six months ago. Three real incidents where it cut investigation time from hours to minutes.
Six months ago we adopted bcc / bpftrace for ad-hoc production debugging on Linux nodes. Before that, our toolchain was strace, tcpdump, perf, and a lot of guessing. Three real incidents below where eBPF tools cut investigation time from hours to minutes — and the specific commands that did it.
eBPF (extended Berkeley Packet Filter) lets you safely run small programs in the kernel. For SREs, the immediate value is the toolkit built on top of it:
tcptracer, opensnoop, execsnoop, etc.)awk for kernel eventsIf you've ever wanted to answer "what's actually happening on this kernel right now" without instrumenting the application, eBPF is the answer.
Symptom: a Go service had p99 latency spiking to 4 seconds. The service's own metrics said it was processing requests in 50ms. The discrepancy was coming from somewhere outside the service.
We first thought network latency to the LB. tcpdump showed nothing unusual on the wire.
# bpftrace one-liner: trace getrandom() syscall and time it
bpftrace -e '
tracepoint:syscalls:sys_enter_getrandom { @start[tid] = nsecs; }
tracepoint:syscalls:sys_exit_getrandom /@start[tid]/ {
@us = hist((nsecs - @start[tid]) / 1000); delete(@start[tid]);
}'
# After 60 seconds:
@us:
[1, 2) 14217 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[2, 4) 3401 |@@@@@@@@@@@@ |
[4, 8) 217 | |
[8, 16) 1 | |
[1M, 2M) 47 | | ← 1–2 seconds!
[2M, 4M) 12 | | ← 2–4 seconds!
Some getrandom() calls were taking 1–4 seconds. That's the kernel's blocking-on-entropy behavior in early-boot or low-entropy conditions.
The service spawned subprocesses per request that linked TLS. Each TLS handshake called getrandom for nonces. Under burst load, the entropy pool drained and subsequent calls blocked.
Switched the service to reuse TLS connections (it had been creating per-request, which is dumb anyway) and added vm.urandom_min_reseed_secs tuning to the host. p99 went from 4s to 80ms.
Without eBPF: would have eventually instrumented Go's crypto/rand calls (hours of code change + redeploy). With bpftrace: diagnosed in 12 minutes.
Symptom: a service intermittently failed to reach an internal API. Connections succeeded most of the time. About 1 in 50 returned a connection-reset.
tcpdump would have helped if we could catch one in flight, but the failures were spread across nodes and pods.
# bcc tool: tcptracer — trace all TCP events kernel-wide
sudo /usr/share/bcc/tools/tcptracer | grep -E "ESTABLISHED|CLOSE_WAIT|RST"
# Filtering output to see what's happening to outgoing connections
$ sudo /usr/share/bcc/tools/tcpconnlat -p $(pgrep my-service)
PID COMM IP SADDR DADDR DPORT LAT(ms)
12891 my-service 4 10.0.1.4 10.0.2.7 443 18.43
12891 my-service 4 10.0.1.4 10.0.2.7 443 1010.21 ← !!!
12891 my-service 4 10.0.1.4 10.0.2.7 443 1010.34 ← !!!
12891 my-service 4 10.0.1.4 10.0.2.7 443 23.11
tcpconnlat shows TCP connection latency. Outliers were clustered at exactly 1010ms.
That number is the tcp_syn_retries retry interval: after a SYN is dropped, Linux waits 1 second before retransmitting. So we were dropping SYN packets to the destination on a small fraction of connections.
A misconfigured network policy was rate-limiting outbound connections from the source pod's namespace. The first SYN of a burst would be dropped; the retransmit succeeded.
Removed the rate-limit (it was a leftover from a forgotten security review). Connection-reset rate dropped to zero immediately.
Without eBPF: would have spent hours coordinating between two teams (network and platform), pulling tcpdump from suspect nodes, correlating timestamps. With tcpconnlat: diagnosed in 8 minutes (most of which was reading the docs to understand the 1010ms number).
Symptom: a logging service alerted on disk fill, but df -h showed the disk at 47% full.
du -sh /var/log matched df — there was no hidden data. But df and the kernel disagreed: the kernel was returning ENOSPC to writes.
This used to be an hours-long mystery. Open files holding deleted data, inode exhaustion, and obscure quota systems can all create this discrepancy.
# bpftrace: trace ENOSPC errors back to the file/process generating them
sudo bpftrace -e '
kretprobe:vfs_write /retval == -28/ { // -28 = -ENOSPC
@[comm, kstack(3)] = count();
}'
# After 30 seconds:
@[fluent-bit, vfs_write+0x...
ext4_file_write_iter+0x...
ksys_write+0x...]: 47
# Now find the file
sudo /usr/share/bcc/tools/opensnoop -n fluent-bit | head -10
PID COMM FD ERR PATH
12345 fluent-bit 9 0 /var/log/audit/audit.log ← writes here
12345 fluent-bit -1 28 /var/log/audit/audit.log ← ENOSPC
The file was on /var/log/audit which was a separate mounted filesystem from /var/log. df -h had been showing /var/log (47% full) but the actual writes were to /var/log/audit (100% full).
Audit logging filled its dedicated partition. The other "disk full" symptoms were noise.
Rotated the audit logs and increased max_log_file_action in auditd.conf to suspend audit logging when disk hits 80%, preventing the same issue.
Without eBPF: would have eventually run mount and noticed the separate filesystem (after maybe an hour of staring at df output). With bpftrace: diagnosed in 6 minutes including the time to look up the ENOSPC errno.
| Question | Tool |
|---|---|
| What syscalls is this process running? | syscount |
| How long does syscall X take? | bpftrace -e 'tracepoint:syscalls:sys_*' |
| What files is anyone opening? | opensnoop |
| What processes are exec'ing? | execsnoop |
| TCP connection latency | tcpconnlat, tcptracer |
| TCP retransmits | tcpretrans |
| Disk I/O patterns | biolatency, biotop |
We installed bcc-tools on every prod node as part of our standard image. The runtime overhead when not in use is zero (probes are inactive); the value when something goes wrong is enormous.
Our median time-to-diagnose for a Linux-kernel-related incident dropped from ~4 hours (interview-driven, instrument-and-redeploy debugging) to ~25 minutes (run targeted eBPF tool, read output, fix). That's not from clever tools alone — it's from having the right tool ready when the question is asked.
opensnoop on a busy host emits thousands of lines per second. Filter aggressively.bcc-tools on every production host. The cost of having it pre-installed is essentially zero.perf for CPU. eBPF doesn't replace traditional profilers; they complement each other.bpftrace one-liners. Building a personal collection of probes makes diagnostic time even shorter.eBPF tools are not "advanced." They've been called that because traditional Unix tools (strace, tcpdump) gate behind hard learning curves. The eBPF tools are easier: opensnoop does one thing (show file opens) clearly. tcpconnlat does one thing (show TCP connect latency) clearly.
The shift is from "instrument the application" to "ask the kernel what's happening." Once you've done that successfully twice, you'll wonder how you debugged Linux without it.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
We invalidate ~6% of LLM outputs before they reach a downstream system. Here's how we structure prompts and validators to catch malformed responses early.
We deployed the same edge function on both platforms and measured for a quarter. Where each wins, where each loses, and the surprises along the way.
Explore more articles in this category
Three production OOM incidents that taught us how kubelet, containerd, and the kernel actually decide which process dies. With debugging commands you'll wish you had earlier.
We migrated 47 cron jobs to systemd timers across our fleet. The mechanical conversion was easy. The interesting parts were the bugs we found that cron had been hiding.
Step-by-step debugging of a production Linux server hitting 100% CPU. From top to perf to the actual fix.
| Page faults |
perf -e page-faults (also eBPF-backed) |
| CPU profiling without instrumentation | bpftop, parca |