Step-by-step debugging of a production Linux server hitting 100% CPU. From top to perf to the actual fix.

On this page

Linux Performance Troubleshooting: A Real Incident Walkthrough

Last month our API servers started responding in 8 seconds instead of 200ms. Here's exactly how we diagnosed and fixed it using standard Linux tools.

Step 1: top - Get the Big Picture #

bash.bash

top -bn1 | head -20

Output showed:

CPU: 98% user, 1% system, 1% idle
One process (node) consuming 94% CPU
Load average: 12.4 on a 4-core machine

Step 2: Identify the Process #

bash.bash

ps aux --sort=-%cpu | head -5

The culprit was our Node.js API process, PID 28431.

Step 3: strace - What's It Doing?#

bash.bash

strace -c -p 28431 -e trace=read,write,futex

Results: 89% of syscalls were futex (lock contention) and read from a file descriptor.

Step 4: Check File Descriptors #

bash.bash

ls -la /proc/28431/fd | wc -l
# Result: 4,847

lsof -p 28431 | grep -c "TCP"
# Result: 4,201 TCP connections

We had 4,201 open TCP connections. Our connection pool had no limit, and a downstream service was responding slowly, causing connections to pile up.

Step 5: Confirm with ss #

bash.bash

ss -tnp | grep 28431 | awk '{print $4}' | sort | uniq -c | sort -rn | head

Result: 3,800 connections to port 5432 (PostgreSQL) in ESTABLISHED state. The database wasn't the bottleneck—our pool was creating connections faster than queries completed.

The Fix #

Added connection pool limits in our database client:

javascript.javascript

const pool = new Pool({
  max: 20,              // was unlimited
  idleTimeoutMillis: 30000,
  connectionTimeoutMillis: 5000,
});

Added a circuit breaker for the slow downstream service
Set up alerts for open file descriptor count:

bash.bash

# /etc/prometheus/alerts/fd_alert.yml
- alert: HighFileDescriptors
  expr: process_open_fds > 1000
  for: 5m
  labels:
    severity: warning

Best Practices for Linux Performance #

Start with top, then drill down with ps, strace, lsof
Check file descriptors when CPU is high—connection leaks are a common cause
Set limits on everything: connection pools, file descriptors (ulimit), memory
Monitor proactively: track open FDs, TCP connections, and load average
Reproduce in staging before making production changes when possible

The issue wasn't a code bug in the traditional sense—it was a missing configuration. Default "unlimited" settings are the cause of more outages than most teams realize.

Linux Performance Troubleshooting: A Real Incident Walkthrough

Linux Performance Troubleshooting: A Real Incident Walkthrough

Step 1: top - Get the Big Picture #

Step 2: Identify the Process #

Step 3: strace - What's It Doing?#

Step 4: Check File Descriptors #

Step 5: Confirm with ss #

The Fix #

Best Practices for Linux Performance #

Stay Updated

Prompt Engineering Patterns That Actually Work in Production

Terraform Modules Done Right: Lessons from Managing 50+ Services

More from Linux

Linux Memory Pressure — Reading PSI Before the OOM Killer Reads You

Linux Network Debugging — tcpdump, ss, and eBPF in Anger

Linux io_uring — Async I/O Patterns We Use

Linux Memory Pressure — Reading PSI Before the OOM Killer Reads You

Linux Network Debugging — tcpdump, ss, and eBPF in Anger

Linux io_uring — Async I/O Patterns We Use

Container Resource Limits — What They Actually Do at the Kernel Level

Alert on Symptoms, Not Causes — SLO Burn-Rate Alerting in Practice

Observability — Correlating Logs, Metrics, and Traces in Anger

You might have missed

GitOps with Argo CD: Best Practices for 2025

Prompt Engineering Best Practices: Maximizing LLM Performance

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025

About Kiril Urbonas

Linux Performance Troubleshooting: A Real Incident Walkthrough

Step 1: top - Get the Big Picture#

Step 2: Identify the Process#

Step 3: strace - What's It Doing?#

Step 4: Check File Descriptors#

Step 5: Confirm with ss#

The Fix#

Best Practices for Linux Performance#

Stay Updated

Prompt Engineering Patterns That Actually Work in Production

Terraform Modules Done Right: Lessons from Managing 50+ Services

More from Linux

Linux Memory Pressure — Reading PSI Before the OOM Killer Reads You

Linux Network Debugging — tcpdump, ss, and eBPF in Anger

Linux io_uring — Async I/O Patterns We Use

You might have missed

GitOps with Argo CD: Best Practices for 2025

Prompt Engineering Best Practices: Maximizing LLM Performance

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025

About Kiril Urbonas

Step 1: top - Get the Big Picture #

Step 2: Identify the Process #

Step 4: Check File Descriptors #

Step 5: Confirm with ss #

The Fix #

Best Practices for Linux Performance #