How processes actually live and die on Linux, the tools that show what's happening, and the patterns we use for monitoring service health.

On this page

Linux Process Management and Monitoring

Process management is one of those topics where the tools are 30 years old, the abstractions are well-understood, and most engineers still get tripped up by the details. After enough production debugging, this is the working mental model: how processes live and die on Linux, the tools that show what's happening, and how we monitor service health.

The process lifecycle #

A process is created by fork() (creating a copy of the parent) or by clone() with various flags (the more general modern syscall). It executes via execve() (replacing its memory with a new program). It exits via exit() (or by being killed).

The parent process is responsible for collecting the exit status of its children. If it doesn't (it ignores SIGCHLD or never calls wait()), the child becomes a zombie — a dead process whose entry remains in the process table because nobody collected its status.

If the parent dies before the child, the child is reparented to PID 1 (init / systemd). PID 1 typically reaps zombies. Without a proper PID 1, zombies accumulate.

This matters for containers. A container's PID 1 is typically your application; if your application doesn't reap children, you'll accumulate zombies. Use a proper init like tini for container PID 1.

Process state codes #

When you look at ps, processes have state codes:

R: running or runnable (in CPU queue)
S: sleeping (waiting for an event, interruptible)
D: uninterruptible sleep (usually IO)
T: stopped (Ctrl-Z or signal)
Z: zombie

The interesting ones for debugging:

D state (uninterruptible sleep) usually means the process is in a kernel call that can't be interrupted, typically IO. Common when disks are slow or NFS hangs. Lots of D-state processes = IO is the bottleneck.

Z state = zombie. Lots of zombies suggests a parent that's not reaping. Find the parent and figure out why.

ps -eo pid,ppid,state,comm shows state plus parent PID, which helps with both.

The tools you actually use #

ps: still the workhorse. ps aux for all processes; ps -ef for full format; ps --forest for hierarchy.

top / htop: live process view. htop is friendlier; top is everywhere.

pidof <name>: returns PID(s) for a process by name.

pgrep <pattern>: like pidof but more flexible (regex match).

pstree: process hierarchy as a tree. pstree -p includes PIDs.

lsof -p <pid>: open files for a process. Includes network sockets.

strace -p <pid>: trace syscalls. Heavy overhead; use sparingly.

/proc/<pid>/: kernel-exposed info per process. /proc/<pid>/status has summary info; /proc/<pid>/cmdline has the full command; /proc/<pid>/limits shows resource limits.

For most production debugging, htop + lsof + /proc is the toolkit. strace and gdb come out for harder problems.

Signals: how processes die (or don't)#

Common signals:

SIGTERM (15): please terminate. Default for kill <pid>. Process can catch and clean up.
SIGINT (2): interrupt. Sent by Ctrl-C.
SIGKILL (9): terminate immediately. Cannot be caught or ignored. Last resort.
SIGHUP (1): traditionally "hangup" but often used to mean "reload config."
SIGSTOP (19): stop process (like Ctrl-Z). SIGCONT (18) resumes.

The lifecycle of a clean shutdown:

Process receives SIGTERM
Process stops accepting new work
Process finishes in-flight work
Process exits

If a process doesn't exit within a timeout, you escalate to SIGKILL.

For our services, the systemd config has:

code

[Service]
KillSignal=SIGTERM
TimeoutStopSec=30s

systemd sends SIGTERM, waits 30 seconds, then SIGKILL if still running. Long-running tasks get more time:

code

TimeoutStopSec=300s  # 5 minutes

Don't use SIGKILL routinely. It doesn't give the process a chance to clean up — connections leak, files get corrupted.

systemd: the modern process manager #

systemd is the init system on most modern Linux. It manages services as "units":

code

# /etc/systemd/system/myservice.service
[Unit]
Description=My Service
After=network-online.target

[Service]
Type=exec
User=myservice
ExecStart=/usr/local/bin/myservice
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

systemctl start myservice, stop, restart, status, enable (start at boot).

Why systemd vs running things from rc.local or whatever:

Auto-restart on failure (with limits)
Proper logging via journald
Resource limits via cgroups
Dependency management (start after network)
Standardized control interface

For our production hosts, every service is a systemd unit. We don't run things via cron (well, mostly — we use systemd timers instead of cron for new services).

Monitoring service health #

Three layers:

Layer 1: process is running. "Is the systemd unit active?" Easy to check; insufficient.

Layer 2: process is healthy. Service-specific health endpoint (/healthz returning 200). Active health, not just "process exists."

Layer 3: service is functioning correctly. Are real requests succeeding? What's the error rate?

Most monitoring tools cover Layer 1 (process exists). The interesting signal is Layer 2 and 3.

Our standard:

systemd auto-restart for Layer 1 issues (process crashes)
Health check endpoint for Layer 2 (called by load balancer; unhealthy pods get pulled)
Application metrics (Prometheus) for Layer 3 (error rate, latency, etc.)

A service that passes Layer 1 but fails Layer 2 is the silent-failure case worth monitoring.

Resource limits #

Limits prevent one service from starving others:

code

# In systemd unit
[Service]
LimitNOFILE=65536           # File descriptors
LimitNPROC=4096             # Processes/threads
MemoryMax=2G                # OOM at this point
CPUQuota=200%               # 2 CPUs worth max
TasksMax=512                # Total tasks

Each limit corresponds to a real production failure mode:

LimitNOFILE: a service handling many connections will hit the default 1024 limit and start dropping connections.
MemoryMax: a runaway service won't take down the whole node.
CPUQuota: useful on multi-tenant nodes.
TasksMax: a thread leak gets caught at the limit.

Defaults are often too low for production services. Set explicitly.

Logging: journald + structured #

Standard pattern:

Service writes to stdout/stderr
systemd captures to journald
journalctl -u myservice reads them back

For application logs, structured JSON is the norm:

json.json

{"timestamp":"2024-04-25T18:32:00Z","level":"INFO","msg":"started","pid":12345}

Structured logs are searchable, indexable, and parseable by log aggregators. Plain-text logs are fine for one-off scripts; for production services, structured.

journalctl features useful for debugging:

sh.sh

journalctl -u myservice                # All logs
journalctl -u myservice -f              # Follow (like tail -f)
journalctl -u myservice --since "1 hour ago"
journalctl -u myservice -p err          # Only errors
journalctl -u myservice -o json         # JSON output

For production, logs ship to a central aggregator (Elasticsearch, Datadog, etc.) via Fluent Bit or similar. journalctl is for ad-hoc debugging on the host.

Runtime introspection: when something is wrong #

When a process is misbehaving:

It's using too much CPU: top shows which. perf profiles the actual hotspots.

It's using too much memory: ps -eo pid,rss,comm shows resident set size. pmap <pid> shows the memory map. For Go/Java/Python, runtime profilers tell you where the memory is going.

It's hung: strace -p <pid> shows what syscall it's stuck in. Often a network call to a slow upstream.

It's slow: perf top shows what's eating CPU. For specific functions, perf record then perf report.

It's leaking file descriptors: lsof -p <pid> | wc -l over time. If growing, find what's not being closed.

For containerized apps, you usually do this from the host (the container often lacks the tools). nsenter lets you enter the container's namespace from the host.

What we monitor at the system level #

Beyond per-service monitoring, host-level metrics:

Load average: 1m, 5m, 15m. > number of CPUs = oversubscribed.
Memory usage: free + cached vs used. OOM-killer logs as alerts.
Disk usage: per filesystem. Free space < 15% = warn; < 5% = critical.
Disk IO: latency, queue depth. High p99 IO latency hurts apps.
Network: per-interface throughput, dropped packets, retransmits.
Conntrack table fullness: nearing capacity = connections dropping.

We use node_exporter for these (with Prometheus). Standard setup, well-understood metrics.

What I'd tell someone learning #

Read the manpage for ps, top, and lsof. They have features people don't know about.

Use systemd for services on modern Linux. rc.local and supervisord are anachronisms.

Set resource limits explicitly. Defaults are wrong for most production services.

SIGTERM, then escalate to SIGKILL. Never SIGKILL first.

Health checks at multiple layers. Process exists, process is healthy, service is functioning.

/proc/<pid>/ is your friend. When tools don't show what you need, the kernel exposes it directly.

Structured logs. When you need to search them, you'll be glad they're not plain text.

Process management on Linux is one of those areas where the fundamentals are stable. The tools have been the same for decades. The patterns (systemd, structured logs, resource limits, layered health checks) are well-known. Most production debugging comes back to these basics — being good at them is worth the practice.

Process Management and Monitoring in Linux

Linux Process Management and Monitoring

The process lifecycle #

Process state codes #

The tools you actually use #

Signals: how processes die (or don't)#

systemd: the modern process manager #

Monitoring service health #

Resource limits #

Logging: journald + structured #

Runtime introspection: when something is wrong #

What we monitor at the system level #

What I'd tell someone learning #

Stay Updated

What We Learned Running Weekly Game Days on Our CI/CD Pipeline

A Pragmatic Multi-Region Strategy for Small Teams

More from Linux

SSH Hardening in 2026: Keys, Certificates, and Bastion Patterns

Linux TCP Tuning for High-Throughput Services

Debugging Latency with eBPF: bpftrace One-Liners That Find It

SSH Hardening in 2026: Keys, Certificates, and Bastion Patterns

Linux TCP Tuning for High-Throughput Services

Debugging Latency with eBPF: bpftrace One-Liners That Find It

systemd Timers vs Cron: Migrating Scheduled Jobs the Right Way

External Secrets Operator: One Secrets Workflow Across Clouds

Four Signals That Matter: Choosing SLIs Users Actually Feel

You might have missed

GitOps with Argo CD: Best Practices for 2025

Prompt Engineering Best Practices: Maximizing LLM Performance

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025

Linux Network Debugging — tcpdump, ss, and eBPF in Anger

About Kiril Urbonas