We started using eBPF tooling for ad-hoc production debugging six months ago. Three real incidents where it cut investigation time from hours to minutes.

eBPF for SREs: Three Real Diagnoses That Saved Hours

Six months ago we adopted bcc / bpftrace for ad-hoc production debugging on Linux nodes. Before that, our toolchain was strace, tcpdump, perf, and a lot of guessing. Three real incidents below where eBPF tools cut investigation time from hours to minutes — and the specific commands that did it.

Quick Refresher #

eBPF (extended Berkeley Packet Filter) lets you safely run small programs in the kernel. For SREs, the immediate value is the toolkit built on top of it:

bcc: a collection of pre-built tools (tcptracer, opensnoop, execsnoop, etc.)
bpftrace: a high-level scripting language for eBPF, like awk for kernel events
bpftop / parca: profilers built on eBPF that don't require recompilation

If you've ever wanted to answer "what's actually happening on this kernel right now" without instrumenting the application, eBPF is the answer.

Incident #1: The "Slow" Service That Wasn't Slow #

Symptom: a Go service had p99 latency spiking to 4 seconds. The service's own metrics said it was processing requests in 50ms. The discrepancy was coming from somewhere outside the service.

We first thought network latency to the LB. tcpdump showed nothing unusual on the wire.

What eBPF revealed #

bash.bash

# bpftrace one-liner: trace getrandom() syscall and time it
bpftrace -e '
tracepoint:syscalls:sys_enter_getrandom { @start[tid] = nsecs; }
tracepoint:syscalls:sys_exit_getrandom /@start[tid]/ {
  @us = hist((nsecs - @start[tid]) / 1000); delete(@start[tid]);
}'

# After 60 seconds:
@us:
[1, 2)               14217 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[2, 4)                3401 |@@@@@@@@@@@@                                       |
[4, 8)                 217 |                                                   |
[8, 16)                  1 |                                                   |
[1M, 2M)                47 |                                                   |  ← 1–2 seconds!
[2M, 4M)                12 |                                                   |  ← 2–4 seconds!

Some getrandom() calls were taking 1–4 seconds. That's the kernel's blocking-on-entropy behavior in early-boot or low-entropy conditions.

Root cause #

The service spawned subprocesses per request that linked TLS. Each TLS handshake called getrandom for nonces. Under burst load, the entropy pool drained and subsequent calls blocked.

Fix #

Switched the service to reuse TLS connections (it had been creating per-request, which is dumb anyway) and added vm.urandom_min_reseed_secs tuning to the host. p99 went from 4s to 80ms.

Time saved #

Without eBPF: would have eventually instrumented Go's crypto/rand calls (hours of code change + redeploy). With bpftrace: diagnosed in 12 minutes.

Incident #2: The Pod That Couldn't Connect #

Symptom: a service intermittently failed to reach an internal API. Connections succeeded most of the time. About 1 in 50 returned a connection-reset.

tcpdump would have helped if we could catch one in flight, but the failures were spread across nodes and pods.

What eBPF revealed #

bash.bash

# bcc tool: tcptracer — trace all TCP events kernel-wide
sudo /usr/share/bcc/tools/tcptracer | grep -E "ESTABLISHED|CLOSE_WAIT|RST"

# Filtering output to see what's happening to outgoing connections
$ sudo /usr/share/bcc/tools/tcpconnlat -p $(pgrep my-service)
PID    COMM         IP SADDR             DADDR             DPORT  LAT(ms)
12891  my-service   4  10.0.1.4          10.0.2.7          443    18.43
12891  my-service   4  10.0.1.4          10.0.2.7          443    1010.21   ← !!!
12891  my-service   4  10.0.1.4          10.0.2.7          443    1010.34   ← !!!
12891  my-service   4  10.0.1.4          10.0.2.7          443    23.11

tcpconnlat shows TCP connection latency. Outliers were clustered at exactly 1010ms.

That number is the tcp_syn_retries retry interval: after a SYN is dropped, Linux waits 1 second before retransmitting. So we were dropping SYN packets to the destination on a small fraction of connections.

Root cause #

A misconfigured network policy was rate-limiting outbound connections from the source pod's namespace. The first SYN of a burst would be dropped; the retransmit succeeded.

Fix #

Removed the rate-limit (it was a leftover from a forgotten security review). Connection-reset rate dropped to zero immediately.

Time saved #

Without eBPF: would have spent hours coordinating between two teams (network and platform), pulling tcpdump from suspect nodes, correlating timestamps. With tcpconnlat: diagnosed in 8 minutes (most of which was reading the docs to understand the 1010ms number).

Incident #3: "Disk Full" That Wasn't #

Symptom: a logging service alerted on disk fill, but df -h showed the disk at 47% full.

du -sh /var/log matched df — there was no hidden data. But df and the kernel disagreed: the kernel was returning ENOSPC to writes.

This used to be an hours-long mystery. Open files holding deleted data, inode exhaustion, and obscure quota systems can all create this discrepancy.

What eBPF revealed #

bash.bash

# bpftrace: trace ENOSPC errors back to the file/process generating them
sudo bpftrace -e '
kretprobe:vfs_write /retval == -28/ {  // -28 = -ENOSPC
  @[comm, kstack(3)] = count();
}'

# After 30 seconds:
@[fluent-bit, vfs_write+0x...
            ext4_file_write_iter+0x...
            ksys_write+0x...]: 47

# Now find the file
sudo /usr/share/bcc/tools/opensnoop -n fluent-bit | head -10
PID    COMM       FD ERR PATH
12345  fluent-bit  9   0 /var/log/audit/audit.log    ← writes here
12345  fluent-bit -1  28 /var/log/audit/audit.log    ← ENOSPC

The file was on /var/log/audit which was a separate mounted filesystem from /var/log. df -h had been showing /var/log (47% full) but the actual writes were to /var/log/audit (100% full).

Root cause #

Audit logging filled its dedicated partition. The other "disk full" symptoms were noise.

Fix #

Rotated the audit logs and increased max_log_file_action in auditd.conf to suspend audit logging when disk hits 80%, preventing the same issue.

Time saved #

Without eBPF: would have eventually run mount and noticed the separate filesystem (after maybe an hour of staring at df output). With bpftrace: diagnosed in 6 minutes including the time to look up the ENOSPC errno.

The Toolkit We Now Carry #

Question	Tool
What syscalls is this process running?	`syscount`
How long does syscall X take?	`bpftrace -e 'tracepoint:syscalls:sys_*'`
What files is anyone opening?	`opensnoop`
What processes are exec'ing?	`execsnoop`
TCP connection latency	`tcpconnlat`, `tcptracer`
TCP retransmits	`tcpretrans`
Disk I/O patterns	`biolatency`, `biotop`
Page faults	`perf -e page-faults` (also eBPF-backed)
CPU profiling without instrumentation	`bpftop`, `parca`

We installed bcc-tools on every prod node as part of our standard image. The runtime overhead when not in use is zero (probes are inactive); the value when something goes wrong is enormous.

What's Improved #

Our median time-to-diagnose for a Linux-kernel-related incident dropped from ~4 hours (interview-driven, instrument-and-redeploy debugging) to ~25 minutes (run targeted eBPF tool, read output, fix). That's not from clever tools alone — it's from having the right tool ready when the question is asked.

What's Still Hard #

Newer kernels are easier; older are tricky. We have one host running an old EL8 kernel where some bcc tools don't have all the probes. We avoid debugging there.
Privilege. eBPF tools generally need root or specific capabilities. Operators need to be aware.
Output volume. opensnoop on a busy host emits thousands of lines per second. Filter aggressively.
Container namespace gotchas. Some tools see the host's view and need flags to focus on a single namespace/pid.

Install bcc-tools on every production host. The cost of having it pre-installed is essentially zero.
Build a runbook of "when X is the symptom, run Y." Pattern-match the questions you'd ask to specific tools.
Practice on staging before production. Get comfortable reading the output formats before you need them at 3 AM.
Combine with perf for CPU. eBPF doesn't replace traditional profilers; they complement each other.
Save your bpftrace one-liners. Building a personal collection of probes makes diagnostic time even shorter.

The Mental Shift #

eBPF tools are not "advanced." They've been called that because traditional Unix tools (strace, tcpdump) gate behind hard learning curves. The eBPF tools are easier: opensnoop does one thing (show file opens) clearly. tcpconnlat does one thing (show TCP connect latency) clearly.

The shift is from "instrument the application" to "ask the kernel what's happening." Once you've done that successfully twice, you'll wonder how you debugged Linux without it.

eBPF for SREs: Three Real Diagnoses That Saved Hours

eBPF for SREs: Three Real Diagnoses That Saved Hours

Quick Refresher #

Incident #1: The "Slow" Service That Wasn't Slow #

What eBPF revealed #

Root cause #

Fix #

Time saved #

Incident #2: The Pod That Couldn't Connect #

What eBPF revealed #

Root cause #

Fix #

Time saved #

Incident #3: "Disk Full" That Wasn't #

What eBPF revealed #

Root cause #

Fix #

Time saved #

The Toolkit We Now Carry #

What's Improved #

What's Still Hard #

The Mental Shift #

Stay Updated

LLM Output Validation: Schema-First Prompt Engineering Patterns

Cloudflare Workers vs Vercel Edge: A Latency-Cost Comparison

More from Linux

Linux io_uring — Async I/O Patterns We Use

Container Resource Limits — What They Actually Do at the Kernel Level

eBPF Tools for Everyday Ops — bpftrace Patterns We Use

Linux io_uring — Async I/O Patterns We Use

Container Resource Limits — What They Actually Do at the Kernel Level

eBPF Tools for Everyday Ops — bpftrace Patterns We Use

systemd Timers vs Cron — What We Learned Switching

HashiCorp Vault as a Secrets Backend for Kubernetes

Pipeline Observability — Why CI Failures Don't Trigger Alerts (And Should)

About Kiril Urbonas

You might have missed

GitOps with Argo CD: Best Practices for 2025

Prompt Engineering Best Practices: Maximizing LLM Performance

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025