Step-by-step debugging of a production Linux server hitting 100% CPU. From top to perf to the actual fix.
Last month our API servers started responding in 8 seconds instead of 200ms. Here's exactly how we diagnosed and fixed it using standard Linux tools.
top -bn1 | head -20
Output showed:
node) consuming 94% CPUps aux --sort=-%cpu | head -5
The culprit was our Node.js API process, PID 28431.
strace -c -p 28431 -e trace=read,write,futex
Results: 89% of syscalls were futex (lock contention) and read from a file descriptor.
ls -la /proc/28431/fd | wc -l
# Result: 4,847
lsof -p 28431 | grep -c "TCP"
# Result: 4,201 TCP connections
We had 4,201 open TCP connections. Our connection pool had no limit, and a downstream service was responding slowly, causing connections to pile up.
ss -tnp | grep 28431 | awk '{print $4}' | sort | uniq -c | sort -rn | head
Result: 3,800 connections to port 5432 (PostgreSQL) in ESTABLISHED state. The database wasn't the bottleneck—our pool was creating connections faster than queries completed.
const pool = new Pool({
max: 20, // was unlimited
idleTimeoutMillis: 30000,
connectionTimeoutMillis: 5000,
});
# /etc/prometheus/alerts/fd_alert.yml
- alert: HighFileDescriptors
expr: process_open_fds > 1000
for: 5m
labels:
severity: warning
top, then drill down with ps, strace, lsofulimit), memoryThe issue wasn't a code bug in the traditional sense—it was a missing configuration. Default "unlimited" settings are the cause of more outages than most teams realize.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
Battle-tested prompt patterns from running LLM features in production: structured output, chain-of-thought, and graceful failure handling.
Practical patterns for Terraform modules at scale: versioning, composition, testing, and avoiding the monolith trap.
Explore more articles in this category
A practical systemd drop-in guide built from a real operations problem: vendor unit files kept changing, but the team still needed consistent restart, environment, and logging behavior.
A practical systemd reliability guide for Linux services, built around repeated restart-loop incidents and the unit-file patterns that finally made those services boring.
A production-tested Linux patch management workflow for teams that need security fixes without turning every maintenance window into a gamble.