A practical Linux performance tuning playbook for production servers. The kernel parameters, disk and network tweaks that earn their place, and the ones that turned out to be folklore.
Linux is well-tuned out of the box for general workloads. For specific high-throughput production workloads, defaults leave performance on the table. This is the playbook we apply to production hosts (database servers, ingress nodes, network appliances), with the actual production reasons each tweak earns its place. Anything not on this list, we leave at default.
Tuning without measuring is voodoo. Our process for any new tweak:
sysctl config; note the production reason in a comment.If a change doesn't measurably help, we revert it. "We thought it would help" is not a reason to keep a tweak.
These earn their place on most production hosts:
# Allow more concurrent connections
net.core.somaxconn = 4096 # listen() backlog
net.ipv4.tcp_max_syn_backlog = 4096 # SYN backlog
net.core.netdev_max_backlog = 5000 # NIC ingress queue
# Conntrack: increase if hitting the table
net.netfilter.nf_conntrack_max = 2000000 # default ~262k
net.netfilter.nf_conntrack_tcp_timeout_established = 300
# TCP window scaling and buffer sizing
net.core.rmem_max = 134217728
net.core.wmem_max = 134217728
net.ipv4.tcp_rmem = 4096 87380 67108864
net.ipv4.tcp_wmem = 4096 65536 67108864
# TIME_WAIT handling for many short-lived connections
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 15
Specific reasons:
somaxconn = 4096: an Nginx reverse proxy with 1024 default queued bursts of new connections during slow upstream responses. Raising it stopped the SYN drops we saw under burst load.nf_conntrack_max: a connection-heavy proxy filled the conntrack table around 240k connections. New connections were dropped silently. Raising the limit (and increasing the table memory) fixed it.tcp_tw_reuse = 1: a service making 5,000+ outbound connections per second was running out of ephemeral ports because of TIME_WAIT accumulation. tw_reuse allowed kernel to reuse TIME_WAIT sockets safely, problem gone.What we do NOT enable:
tcp_tw_recycle: removed in kernel 4.12 because it broke NAT'd clients. Don't use, don't try.net.ipv4.tcp_timestamps = 0: sometimes recommended, but timestamps are needed for PAWS protection on long connections. Leave at default.Default file descriptor limits are too low for many services:
# /etc/security/limits.d/production.conf
* soft nofile 1048576
* hard nofile 1048576
# /etc/systemd/system/myservice.service.d/override.conf
[Service]
LimitNOFILE=1048576
For containers, set this in the container's spec. systemd sets per-service limits via LimitNOFILE in the unit file.
When we hit the default 1024 limit, the service drops new connections silently with weird errors. The fix is universal — raise it. The cost is essentially zero (FDs use minimal kernel memory until used).
For NVMe SSDs (most of our database servers):
# Use 'none' or 'mq-deadline' for NVMe
echo none > /sys/block/nvme0n1/queue/scheduler
# Adjust read-ahead based on workload
# Sequential workloads (e.g., sequential scans): higher
# Random workloads (e.g., OLTP): lower
blockdev --setra 256 /dev/nvme0n1
Why:
none (no scheduler) is often the right answer.For magnetic disks (rare in our fleet now), mq-deadline or bfq work better.
We measure by running fio benchmarks before and after. Typical NVMe: ~5-15% latency reduction switching from mq-deadline to none, ~10% throughput improvement on random IO.
For ext4 (our default):
/dev/nvme0n1 /var/lib/postgresql ext4 defaults,noatime,nodiratime 0 2
noatime disables access-time updates on every file read. The default updates inodes on every read, which means reads become writes. For a busy filesystem, this is a meaningful overhead.
nodiratime does the same for directories.
We've seen 3-8% I/O reduction switching to noatime. No downsides for our workloads (we don't depend on access times).
For xfs:
noatime, nodiratime, attr2, inode64, noquota are the defaults we use for production data filesystems.# How aggressively to swap (default 60)
vm.swappiness = 10
# Dirty page writeback tuning
vm.dirty_ratio = 15 # default 20
vm.dirty_background_ratio = 5 # default 10
vm.dirty_expire_centisecs = 1000 # 10 seconds
swappiness = 10 means the kernel prefers to evict cached file pages over swapping process memory. For database servers, this is what you want — process memory is hot; swap is slow.
Some teams set swappiness = 0 to "disable swap." This is wrong; it's too aggressive and can lead to OOM kills when the system has plenty of swap available.
The dirty ratios control when dirty pages get flushed. Defaults are too lazy for high-write workloads — when the kernel finally flushes, it stalls everything. Lower thresholds mean more frequent, smaller flushes; smoother latency.
Specifically: a database server writing 200MB/s would build up GB of dirty pages, then stall everything for seconds when fsync hit. Tuning these reduced the stalls dramatically.
For latency-sensitive servers:
# Performance governor — disables frequency scaling
for c in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
echo performance > $c
done
# Disable C-states for ultra-low-latency (high power cost)
# /etc/default/grub: GRUB_CMDLINE_LINUX="... intel_idle.max_cstate=1 processor.max_cstate=1"
The performance governor keeps CPUs at max frequency. Cost: ~5-15% more power. Benefit: latency floors are flat, no warm-up time.
For very latency-sensitive workloads (low-latency trading, real-time), CPU pinning can help. Pin the process to specific cores; pin IRQs away from those cores. This is overkill for most workloads; we use it only on a few specific services.
What we DON'T do:
Changes go in /etc/sysctl.d/99-production-tuning.conf:
# Network
net.core.somaxconn = 4096
net.ipv4.tcp_max_syn_backlog = 4096
...
Apply with sysctl -p /etc/sysctl.d/99-production-tuning.conf. Verified with sysctl <key> to confirm the value took effect.
For cloud images, we bake these into the AMI. For Kubernetes, the host-level tuning is on the node images; pod-level limits are in pod specs.
Things that matter on K8s nodes:
# Allow more conntrack entries (kube-proxy creates many)
net.netfilter.nf_conntrack_max = 2000000
# Allow more iptables rules (Services use them heavily)
net.core.netdev_max_backlog = 5000
# Inotify limits for kubelet's many watches
fs.inotify.max_user_watches = 524288
fs.inotify.max_user_instances = 8192
# More PIDs for many pods/containers
kernel.pid_max = 4194304
On a 32-vCPU node with 100+ pods, default inotify limits get exhausted, kubelet starts erroring. The above limits prevent that.
For one of our database hosts, before and after this tuning:
| Metric | Default | Tuned | Δ |
|---|---|---|---|
| pgbench TPS (read-heavy) | 18,500 | 22,800 | +23% |
| p99 query latency | 24ms | 16ms | -33% |
| IOPS sustained (random read) | 280k | 320k | +14% |
| TCP retransmits/min | ~12 | ~3 | -75% |
| OOM events/month | 1-2 | 0 | -100% |
Most of the Δ came from network and IO scheduler tweaks. The memory tuning prevented OOMs but didn't change throughput much.
A few things from older guides that don't earn their place:
Disabling THP (Transparent Huge Pages). Recommended for old MongoDB / Redis versions. Modern versions cope fine. We leave THP enabled (the default).
Lots of small per-connection tunables like net.ipv4.tcp_keepalive_time. Defaults are fine for most workloads. Touch them only if you have a specific symptom they address.
tcp_mtu_probing. Sometimes recommended; we've seen it cause problems in mixed environments. Default is fine.
Custom kernel builds to remove unused features. Not worth the maintenance burden vs the marginal benefit.
Measure first. Tune second. Re-measure to verify. Without measurement, you can't tell which tweaks helped.
One change at a time. Stacking changes hides which one caused which effect.
The defaults are usually right. Linux kernel defaults represent millions of hours of tuning. Be skeptical of any tweak that doesn't have a clear, measurable production reason.
Network and IO are usually the bottlenecks. Start there. CPU tuning is for specific niche cases.
Document why each tweak is there. Six months later, when someone asks "why is nf_conntrack_max so high," the comment in your sysctl config saves a half-day of debugging.
Linux performance tuning is mostly mechanical. A small set of tweaks earns its place across most production servers. The rest is folklore — copying patterns from other people's blog posts without measuring whether they actually help. The discipline is in the measurement, not the tweaks themselves.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
A real story of removing console-only changes, adding drift detection, and getting Terraform back in charge.
A field report from rolling out retrieval-augmented generation in production, including cache bugs, bad embeddings, and how we fixed them.
Explore more articles in this category
We migrated most scheduled jobs from cron to systemd timers. The wins, the gotchas, and the cases we kept on cron anyway.
A curated list of shell one-liners that earn their place in real ops work — the ones I reach for weekly, not the trick-shot variety.
Generate an SSH key, set up passwordless login, and configure aliases for the servers you use daily — all without copy-pasting yet another long command.