A practical Linux performance tuning playbook for production servers. The kernel parameters, disk and network tweaks that earn their place, and the ones that turned out to be folklore.

On this page

Linux Performance Tuning for Production Servers

Linux is well-tuned out of the box for general workloads. For specific high-throughput production workloads, defaults leave performance on the table. This is the playbook we apply to production hosts (database servers, ingress nodes, network appliances), with the actual production reasons each tweak earns its place. Anything not on this list, we leave at default.

The methodology before the tweaks #

Tuning without measuring is voodoo. Our process for any new tweak:

Establish baseline. Run the actual workload under realistic load. Record p50/p95/p99 latency, throughput, CPU/memory/IO utilization.
Hypothesize. "If we increase X, we expect Y to improve because Z."
Apply the change in isolation. One change at a time.
Re-measure. Did Y improve? By how much?
Verify no regressions in unrelated metrics.
Document and persist. Add to our sysctl config; note the production reason in a comment.

If a change doesn't measurably help, we revert it. "We thought it would help" is not a reason to keep a tweak.

Network: the most common bottleneck #

These earn their place on most production hosts:

sh.sh

# Allow more concurrent connections
net.core.somaxconn = 4096                    # listen() backlog
net.ipv4.tcp_max_syn_backlog = 4096          # SYN backlog
net.core.netdev_max_backlog = 5000           # NIC ingress queue

# Conntrack: increase if hitting the table
net.netfilter.nf_conntrack_max = 2000000     # default ~262k
net.netfilter.nf_conntrack_tcp_timeout_established = 300

# TCP window scaling and buffer sizing
net.core.rmem_max = 134217728
net.core.wmem_max = 134217728
net.ipv4.tcp_rmem = 4096 87380 67108864
net.ipv4.tcp_wmem = 4096 65536 67108864

# TIME_WAIT handling for many short-lived connections
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 15

Specific reasons:

somaxconn = 4096: an Nginx reverse proxy with 1024 default queued bursts of new connections during slow upstream responses. Raising it stopped the SYN drops we saw under burst load.
nf_conntrack_max: a connection-heavy proxy filled the conntrack table around 240k connections. New connections were dropped silently. Raising the limit (and increasing the table memory) fixed it.
tcp_tw_reuse = 1: a service making 5,000+ outbound connections per second was running out of ephemeral ports because of TIME_WAIT accumulation. tw_reuse allowed kernel to reuse TIME_WAIT sockets safely, problem gone.
TCP buffer sizing: we have services that move large payloads across high-latency links. Default buffer sizes meant the connection couldn't keep the pipe full. Larger buffers improved throughput by 3x.

What we do NOT enable:

tcp_tw_recycle: removed in kernel 4.12 because it broke NAT'd clients. Don't use, don't try.
net.ipv4.tcp_timestamps = 0: sometimes recommended, but timestamps are needed for PAWS protection on long connections. Leave at default.

File descriptors #

Default file descriptor limits are too low for many services:

sh.sh

# /etc/security/limits.d/production.conf
* soft nofile 1048576
* hard nofile 1048576

# /etc/systemd/system/myservice.service.d/override.conf
[Service]
LimitNOFILE=1048576

For containers, set this in the container's spec. systemd sets per-service limits via LimitNOFILE in the unit file.

When we hit the default 1024 limit, the service drops new connections silently with weird errors. The fix is universal — raise it. The cost is essentially zero (FDs use minimal kernel memory until used).

Disk: I/O scheduler and read-ahead #

For NVMe SSDs (most of our database servers):

sh.sh

# Use 'none' or 'mq-deadline' for NVMe
echo none > /sys/block/nvme0n1/queue/scheduler

# Adjust read-ahead based on workload
# Sequential workloads (e.g., sequential scans): higher
# Random workloads (e.g., OLTP): lower
blockdev --setra 256 /dev/nvme0n1

Why:

For NVMe with native queueing, the kernel's I/O scheduler usually adds latency without benefit. none (no scheduler) is often the right answer.
Read-ahead controls how aggressively the kernel pre-reads. For random-access workloads (databases), aggressive read-ahead just wastes IO.

For magnetic disks (rare in our fleet now), mq-deadline or bfq work better.

We measure by running fio benchmarks before and after. Typical NVMe: ~5-15% latency reduction switching from mq-deadline to none, ~10% throughput improvement on random IO.

Filesystem mount options #

For ext4 (our default):

code

/dev/nvme0n1 /var/lib/postgresql ext4 defaults,noatime,nodiratime 0 2

noatime disables access-time updates on every file read. The default updates inodes on every read, which means reads become writes. For a busy filesystem, this is a meaningful overhead.

nodiratime does the same for directories.

We've seen 3-8% I/O reduction switching to noatime. No downsides for our workloads (we don't depend on access times).

For xfs:

noatime, nodiratime, attr2, inode64, noquota are the defaults we use for production data filesystems.

Memory and swap #

sh.sh

# How aggressively to swap (default 60)
vm.swappiness = 10

# Dirty page writeback tuning
vm.dirty_ratio = 15           # default 20
vm.dirty_background_ratio = 5  # default 10
vm.dirty_expire_centisecs = 1000  # 10 seconds

swappiness = 10 means the kernel prefers to evict cached file pages over swapping process memory. For database servers, this is what you want — process memory is hot; swap is slow.

Some teams set swappiness = 0 to "disable swap." This is wrong; it's too aggressive and can lead to OOM kills when the system has plenty of swap available.

The dirty ratios control when dirty pages get flushed. Defaults are too lazy for high-write workloads — when the kernel finally flushes, it stalls everything. Lower thresholds mean more frequent, smaller flushes; smoother latency.

Specifically: a database server writing 200MB/s would build up GB of dirty pages, then stall everything for seconds when fsync hit. Tuning these reduced the stalls dramatically.

CPU: governor and pinning #

For latency-sensitive servers:

sh.sh

# Performance governor — disables frequency scaling
for c in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
  echo performance > $c
done

# Disable C-states for ultra-low-latency (high power cost)
# /etc/default/grub: GRUB_CMDLINE_LINUX="... intel_idle.max_cstate=1 processor.max_cstate=1"

The performance governor keeps CPUs at max frequency. Cost: ~5-15% more power. Benefit: latency floors are flat, no warm-up time.

For very latency-sensitive workloads (low-latency trading, real-time), CPU pinning can help. Pin the process to specific cores; pin IRQs away from those cores. This is overkill for most workloads; we use it only on a few specific services.

What we DON'T do:

Disable hyperthreading. Modern CPUs do better with HT enabled for nearly all workloads.
Disable Spectre/Meltdown mitigations. The performance cost is real but security matters more. Kernel updates have made the mitigations cheaper.

Sysctl: persisting changes #

Changes go in /etc/sysctl.d/99-production-tuning.conf:

code

# Network
net.core.somaxconn = 4096
net.ipv4.tcp_max_syn_backlog = 4096
...

Apply with sysctl -p /etc/sysctl.d/99-production-tuning.conf. Verified with sysctl <key> to confirm the value took effect.

For cloud images, we bake these into the AMI. For Kubernetes, the host-level tuning is on the node images; pod-level limits are in pod specs.

Specific to Kubernetes nodes #

Things that matter on K8s nodes:

sh.sh

# Allow more conntrack entries (kube-proxy creates many)
net.netfilter.nf_conntrack_max = 2000000

# Allow more iptables rules (Services use them heavily)
net.core.netdev_max_backlog = 5000

# Inotify limits for kubelet's many watches
fs.inotify.max_user_watches = 524288
fs.inotify.max_user_instances = 8192

# More PIDs for many pods/containers
kernel.pid_max = 4194304

On a 32-vCPU node with 100+ pods, default inotify limits get exhausted, kubelet starts erroring. The above limits prevent that.

What we measured before and after #

For one of our database hosts, before and after this tuning:

Metric	Default	Tuned	Δ
pgbench TPS (read-heavy)	18,500	22,800	+23%
p99 query latency	24ms	16ms	-33%
IOPS sustained (random read)	280k	320k	+14%
TCP retransmits/min	~12	~3	-75%
OOM events/month	1-2	0	-100%

Most of the Δ came from network and IO scheduler tweaks. The memory tuning prevented OOMs but didn't change throughput much.

A few things from older guides that don't earn their place:

Disabling THP (Transparent Huge Pages). Recommended for old MongoDB / Redis versions. Modern versions cope fine. We leave THP enabled (the default).

Lots of small per-connection tunables like net.ipv4.tcp_keepalive_time. Defaults are fine for most workloads. Touch them only if you have a specific symptom they address.

tcp_mtu_probing. Sometimes recommended; we've seen it cause problems in mixed environments. Default is fine.

Custom kernel builds to remove unused features. Not worth the maintenance burden vs the marginal benefit.

What I'd tell someone starting #

Measure first. Tune second. Re-measure to verify. Without measurement, you can't tell which tweaks helped.

One change at a time. Stacking changes hides which one caused which effect.

The defaults are usually right. Linux kernel defaults represent millions of hours of tuning. Be skeptical of any tweak that doesn't have a clear, measurable production reason.

Network and IO are usually the bottlenecks. Start there. CPU tuning is for specific niche cases.

Document why each tweak is there. Six months later, when someone asks "why is nf_conntrack_max so high," the comment in your sysctl config saves a half-day of debugging.

Linux performance tuning is mostly mechanical. A small set of tweaks earns its place across most production servers. The rest is folklore — copying patterns from other people's blog posts without measuring whether they actually help. The discipline is in the measurement, not the tweaks themselves.

Linux Performance Tuning: Optimizing System Performance

Linux Performance Tuning for Production Servers

The methodology before the tweaks #

Network: the most common bottleneck #

File descriptors #

Disk: I/O scheduler and read-ahead #

Filesystem mount options #

Memory and swap #

CPU: governor and pinning #

Sysctl: persisting changes #

Specific to Kubernetes nodes #

What we measured before and after #

What I'd tell someone starting #

Stay Updated

How We Stopped Terraform Drift from Surprising On-Call

Real-World RAG Incidents: Lessons from a Production Rollout

More from Linux

SSH Hardening in 2026: Keys, Certificates, and Bastion Patterns

Linux TCP Tuning for High-Throughput Services

Debugging Latency with eBPF: bpftrace One-Liners That Find It

SSH Hardening in 2026: Keys, Certificates, and Bastion Patterns

Linux TCP Tuning for High-Throughput Services

Debugging Latency with eBPF: bpftrace One-Liners That Find It

systemd Timers vs Cron: Migrating Scheduled Jobs the Right Way

External Secrets Operator: One Secrets Workflow Across Clouds

Four Signals That Matter: Choosing SLIs Users Actually Feel

You might have missed

GitOps with Argo CD: Best Practices for 2025

Prompt Engineering Best Practices: Maximizing LLM Performance

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025

About Kiril Urbonas