When everything seems "slow," a baseline gives you something to measure against. The capture-and-compare workflow we use on every Linux host.
When you hit a performance issue on a Linux host, the most useful thing you can have is a recorded "this is what normal looks like." Without it, every metric is suspect and every diagnosis starts from "is this number high?" We adopted a small, consistent baseline-capture workflow about 18 months ago. It's the first thing we run on any newly provisioned host, and the first thing we compare against during an incident.
A snapshot of how a host behaves under representative load. Not a benchmark (synthetic workloads can lie); not a stress test (peak load tells you a different story). A baseline is "here's what this machine's CPU, memory, I/O, and network look like during a normal hour."
The point isn't to memorise numbers. The point is to have something to subtract from the current state when something feels off.
Six datasets, captured over a one-hour window:
mpstat 60 60 (one minute samples for 60 minutes)free -m snapshots every 60s, plus /proc/meminfo parsingiostat -x 60 60sar -n DEV 60 60pidstat 60 60 (top 20 by CPU and memory)uptime start/end, kernel version, sysctl dumpA small script wraps these into a tarball with a timestamp:
#!/bin/bash
set -euo pipefail
HOST=$(hostname)
DATE=$(date +%Y-%m-%dT%H%M%S)
DIR=$(mktemp -d)
trap "rm -rf $DIR" EXIT
(mpstat 60 60 > "$DIR/mpstat.log") &
(iostat -x 60 60 > "$DIR/iostat.log") &
(sar -n DEV 60 60 > "$DIR/sar-net.log") &
(pidstat 60 60 > "$DIR/pidstat.log") &
for i in $(seq 1 60); do
cat /proc/meminfo > "$DIR/meminfo-$i.log"
free -m > "$DIR/free-$i.log"
sleep 60
done &
uname -a > "$DIR/uname.txt"
sysctl -a > "$DIR/sysctl.txt" 2>/dev/null
cat /proc/cpuinfo > "$DIR/cpuinfo.txt"
uptime > "$DIR/uptime.txt"
wait
tar czf "/var/baselines/${HOST}-${DATE}.tgz" -C "$DIR" .
echo "Baseline saved: /var/baselines/${HOST}-${DATE}.tgz"
It runs as a cron job once when a host is provisioned, and again any time the host's role changes (e.g., it gets a new workload, a kernel upgrade, a config change). The tarballs are 2-3 MB each.
Three triggers, in order of importance:
We don't capture continuously. The point isn't telemetry — it's a reference point. Live telemetry (Prometheus) is separate; baseline is for when telemetry isn't enough.
When something looks wrong, we run the same capture script for 5 minutes (SAMPLES=5 env override on the script). Then we have a comparison-friendly script that pulls a key set of metrics from the most recent baseline and the current capture:
COMPARING: 2026-04-25T14:32:11 vs baseline 2026-03-12T09:00:00
CPU usage: baseline avg %usr=22 current avg %usr=68 ΔΔ +46
baseline avg %iowait=2 current avg %iowait=18 ΔΔ +16
Memory: baseline avail=14GB current avail=2GB DROP -12GB
baseline cache=8GB current cache=1GB DROP -7GB
Disk I/O (sda): baseline await=4ms current await=92ms ΔΔ +88ms
baseline %util=12 current %util=99 ΔΔ +87
Network (eth0): baseline rxbps=80M current rxbps=85M stable
baseline txbps=20M current txbps=22M stable
Top procs by CPU: baseline: nginx (8%), java (4%)
current: java (62%), gc-thread (18%)
The comparison view tells you immediately: this incident is CPU + I/O bound, the Java process is the source, network is unrelated. Without the baseline, you'd be looking at "Java is using 62% CPU" and asking "is that high?" — with the baseline, you know it's 58 percentage points above normal.
Three categories of issue:
Slow drift. A host's behaviour changes gradually over weeks. Metrics dashboards only show recent windows; a week-over-week shift is hard to see in real-time. A fresh capture compared to the baseline from 6 weeks ago surfaces drift instantly.
Workload-specific norms. "5% iowait" is fine on most hosts and alarming on a host that's normally at 0.5%. The baseline encodes what "normal" means for THIS host's role, not generic.
Post-change regressions. After a kernel upgrade, the new baseline can be compared against the pre-upgrade one. If memory pressure went up 20% with no workload change, that's a kernel-level cost we should know about.
Some patterns we've seen repeatedly when comparing baselines to current state:
%iowait jump with r/s jump and no w/s change. Indicates increased read load — usually a process that started reading large files, or a cache that was warm went cold (e.g., after a service restart). Look at pidstat -d for the responsible process.
avail memory drop with no process growth. Page cache being squeezed, often by a separate process that's grown its anonymous (heap) memory. The free memory is reclaimable but performance suffers because every read becomes a disk read.
%steal non-zero. You're on a virtualised host and the hypervisor is taking CPU from you. Not your fault, but knowing it stops you from chasing application bugs that aren't there.
Network bandwidth same but pps doubled. Smaller packets — either smaller HTTP responses, more handshake traffic, or a connection storm. Worth investigating.
perf record traces. Useful when you have a known hot path; not useful as part of a routine baseline.strace of every process. Massive volume, mostly noise.The baseline is meant to be lightweight enough that it's run automatically and the tarballs are small enough to be kept indefinitely. Anything more invasive should be on-demand during active investigation.
Two things, both fixed:
The first baseline script didn't include cpuinfo or sysctl. When we hit a performance issue on a host that turned out to have transparent huge pages disabled (someone had set transparent_hugepage=never weeks earlier), the baseline didn't tell us that. Adding the system-config dump caught the next case before it bit us.
The second: we didn't initially correlate the baseline timestamp with deploys. When an incident hit, the question "what changed since the baseline" required cross-referencing manually. We now annotate the baseline with the git commit deployed at capture time. Comparison shows the diff between the baseline's version and the current version automatically.
For containerised workloads where the host is shared, baseline at the host level is misleading — your container's behaviour is mixed with neighbours'. We use cgroup-scoped versions of the same script for our K8s nodes (capturing per-cgroup metrics from /sys/fs/cgroup/...). Same idea, different filesystem path.
For ephemeral workloads (lambdas, serverless), the baseline concept doesn't apply — there's no persistent host to baseline. We use vendor-provided telemetry instead.
The script runs in the background; it costs ~1% CPU during the capture hour. The tarball storage cost is negligible (~3 MB × 50 hosts × 4 baselines/year = ~600 MB across the fleet annually).
The compare script is the one we run during incidents. It takes about 90 seconds to run end to end. The cognitive value during a 3 AM incident is enormous — having "this is what normal looked like" available within two minutes changes the whole texture of the investigation.
Baselines are unsexy. Nobody publishes blog posts about "I have a tarball of mpstat from six weeks ago." But the first time you're at 3 AM staring at a server that "feels slow" and you can run a 90-second compare to see exactly what's drifted, you stop looking unsexy and start looking essential.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
HPA, VPA, and Cluster Autoscaler / Karpenter solve overlapping problems badly when you don't understand which one owns what. The mental model that keeps them from fighting.
We deploy LangChain apps in Docker on Kubernetes. The patterns that work, the LangChain-specific gotchas, and what we'd build differently next time.
Explore more articles in this category
We migrated most scheduled jobs from cron to systemd timers. The wins, the gotchas, and the cases we kept on cron anyway.
A curated list of shell one-liners that earn their place in real ops work — the ones I reach for weekly, not the trick-shot variety.
Generate an SSH key, set up passwordless login, and configure aliases for the servers you use daily — all without copy-pasting yet another long command.
Evergreen posts worth revisiting.