Step-by-step debugging of a production Linux server hitting 100% CPU. From top to perf to the actual fix.
Last month our API servers started responding in 8 seconds instead of 200ms. Here's exactly how we diagnosed and fixed it using standard Linux tools.
top -bn1 | head -20
Output showed:
node) consuming 94% CPUps aux --sort=-%cpu | head -5
The culprit was our Node.js API process, PID 28431.
strace -c -p 28431 -e trace=read,write,futex
Results: 89% of syscalls were futex (lock contention) and read from a file descriptor.
ls -la /proc/28431/fd | wc -l
# Result: 4,847
lsof -p 28431 | grep -c "TCP"
# Result: 4,201 TCP connections
We had 4,201 open TCP connections. Our connection pool had no limit, and a downstream service was responding slowly, causing connections to pile up.
ss -tnp | grep 28431 | awk '{print $4}' | sort | uniq -c | sort -rn | head
Result: 3,800 connections to port 5432 (PostgreSQL) in ESTABLISHED state. The database wasn't the bottleneck—our pool was creating connections faster than queries completed.
const pool = new Pool({
max: 20, // was unlimited
idleTimeoutMillis: 30000,
connectionTimeoutMillis: 5000,
});
# /etc/prometheus/alerts/fd_alert.yml
- alert: HighFileDescriptors
expr: process_open_fds > 1000
for: 5m
labels:
severity: warning
top, then drill down with ps, strace, lsofulimit), memoryThe issue wasn't a code bug in the traditional sense—it was a missing configuration. Default "unlimited" settings are the cause of more outages than most teams realize.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
Battle-tested prompt patterns from running LLM features in production: structured output, chain-of-thought, and graceful failure handling.
Practical patterns for Terraform modules at scale: versioning, composition, testing, and avoiding the monolith trap.
Explore more articles in this category
Free memory is a lie and load average doesn't see memory stalls. How Pressure Stall Information gives you a direct, early signal of memory contention — and how we wired it into alerts and autoscaling.
When the service is slow and the network is suspect, these are the tools we reach for, in this order, with the exact flags that find the answer.
io_uring replaces epoll for new high-throughput services. The patterns that earn their place, the gotchas in older kernels, and where we'd still pick epoll.
Evergreen posts worth revisiting.