Filesystem choice, mount options, IO schedulers — the per-host tweaks that actually moved disk performance for our database and storage workloads.
Filesystem and disk-level tuning is one of those topics where defaults are fine for general use and meaningfully wrong for specific workloads. We learned this the slow way — debugging database performance issues that turned out to be IO-related. This post is the working playbook for filesystem and disk tuning, with the actual measurable improvements we've seen.
Disk tuning matters most when:
For most application servers (web frontends, stateless APIs), default filesystem settings are fine. Don't tune for the sake of tuning.
For production Linux, the realistic options are ext4 and xfs. We use both:
ext4: default on most distros. Mature, stable, well-supported. Good for most workloads up to medium size.
xfs: better at scaling to large filesystems and parallel writes. Default for RHEL family. Our database servers (Postgres) use xfs.
Specific reasons we'd pick one over the other:
We don't use btrfs in production. Cool features but operationally fiddly. We don't use ZFS on Linux for the same reason (great filesystem but operational complexity).
These earn their place on most production filesystems:
/dev/nvme1n1 /var/lib/postgresql ext4 defaults,noatime,nodiratime 0 2
noatime: stop updating access timestamps on every read. Default behavior turns reads into writes (updating the inode's atime), which is wasteful.
nodiratime: same for directory access times.
For our database server, noatime reduced IO by ~6%. Free.
For xfs:
/dev/nvme1n1 /var/lib/data xfs defaults,noatime,nodiratime,attr2,inode64 0 2
inode64: allows inodes to be allocated anywhere. Default is to put inodes only in the first 1TB of the filesystem; with inode64, large filesystems can have inodes anywhere. Necessary for filesystems > 1TB.
What we don't enable:
data=writeback on ext4 (faster writes, more data loss risk on crash). The default data=ordered is safer.nobarrier (disables write barriers). Faster but unsafe on power loss.nodelalloc (disables delayed allocation). Sometimes recommended; we haven't seen consistent benefit.The pattern: tweak for things that are clearly safe; don't disable safety features for marginal performance gains.
On modern NVMe SSDs, the kernel's IO scheduler often adds latency. Default schedulers are designed for spinning disks; for NVMe, they often hurt.
echo none > /sys/block/nvme0n1/queue/scheduler
For our NVMe-backed database, switching from mq-deadline (default) to none reduced p99 read latency by ~12% and improved IOPS by ~8%. NVMe handles its own queueing; the kernel's queue management is redundant.
For magnetic disks (rare in our fleet now), mq-deadline or bfq work better.
Persist across reboots via udev:
# /etc/udev/rules.d/60-iosched.rules
ACTION=="add|change", KERNEL=="nvme[0-9]n[0-9]*", ATTR{queue/scheduler}="none"
The kernel reads ahead of the actual read request, anticipating sequential access. For random-access workloads (databases), aggressive read-ahead wastes IO.
blockdev --setra 256 /dev/nvme0n1 # 256 blocks of 512 bytes = 128KB
For Postgres with random access patterns, we set read-ahead to 256 (lower than the 4096 default on many distros). Reduced IO load by ~10% with no read latency impact.
For sequential workloads (log writes, large file copies), keep higher read-ahead.
Persist via tune2fs (ext4) or via boot script.
For NVMe SSDs and modern filesystems, alignment is usually correct out of the box. Worth checking with new disks:
parted /dev/nvme0n1 align-check optimal 1
Misaligned filesystems can be ~10-30% slower. Modern tooling avoids this; legacy systems might have issues.
mkfs.ext4 -b <block-size> controls block size. Larger blocks = less metadata overhead but more space waste for small files.
For databases that mostly do 8KB or 16KB IO (Postgres pages are 8KB by default), ext4's default 4KB block size is fine. Don't try to "optimize" by matching block sizes; the kernel handles it.
For Postgres specifically, transparent hugepages (THP) used to be recommended off. Modern Postgres with modern kernel handles THP fine. We leave it on default.
For some workloads (HotSpot JVM, custom databases), explicit hugepages help. We don't use them in our stack.
vm.dirty_ratio = 15
vm.dirty_background_ratio = 5
vm.dirty_expire_centisecs = 1000
Mentioned in the Linux performance post; relevant here too. These control when dirty pages get flushed to disk. Defaults are too lazy for high-write workloads — when the kernel finally flushes, it stalls everything.
For our database server, lower dirty thresholds reduced write latency variance significantly. p99 write latency went from spiky (sometimes 200ms+) to steady (~30ms).
For RAID or LVM-striped volumes, the stripe size matters. ext4 wants to know about it via mkfs options:
mkfs.ext4 -E stride=16,stripe-width=32 /dev/...
Where stride = chunk_size / block_size and stripe-width = stride * num_data_disks. If the filesystem doesn't know the stripe geometry, writes can fall awkwardly across stripes, hurting performance.
Most cloud-managed disks aren't user-RAIDed (the cloud handles the redundancy at a lower layer), so this is less common. For specific high-IOPS setups, it matters.
Once you've tuned, monitor:
Tools:
iostat -x 1: per-device stats. The await column is average IO latency; svctm is service time. Both worth watching.iotop: which processes are doing IO. Useful for "why is the disk pegged."nvme smart-log: NVMe-specific health (wear leveling, error counts, temperature). NVMe SSDs have a finite write endurance; SMART tells you how close to it you are.For Prometheus, node_exporter exposes node_disk_* metrics that capture all of this.
Production issues:
Postgres slowing down dramatically every 5 minutes. The vm.dirty_* ratios were at default. Every 5 minutes the kernel would flush GB of dirty pages, IO saturated, queries stalled. Tuned the ratios; problem gone.
Database server's NVMe wearing out faster than expected. Heavy write amplification because of how data=ordered writes journal entries. We moved logs to a separate disk and kept data on the main NVMe. Per-disk write rate dropped, wear leveling improved.
File operations slow on a large filesystem. Discovered the filesystem was 80%+ full. ext4 (and other CoW or journaling filesystems) gets slower as it fills up. The right answer wasn't "tune more" — it was "free up space" or "extend the filesystem."
XFS error during a kernel upgrade. A specific kernel had an XFS regression. Rolled back the kernel; reported upstream. Subsequent kernel fixed it. Lesson: don't rush kernel upgrades on database hosts.
For AWS EBS:
gp3 is the default for most workloads. Configure IOPS and throughput separately from size.io2 for very high-IOPS needs. Significantly more expensive.gp2 anymore (older generation); gp3 is cheaper and faster.For specific high-IOPS database workloads, we've moved from gp2 to gp3 with provisioned IOPS. Cost is similar; performance is meaningfully better.
For high-throughput sequential workloads, st1 (HDD-based, throughput-optimized) is much cheaper than SSD. Used for our log archival; not for anything latency-sensitive.
Tweaks that don't earn their place for our workloads:
Custom kernel I/O policies via cgroups. Useful for multi-tenant scenarios; we don't have meaningful tenancy at this layer.
Filesystem-specific compression (e.g., btrfs/zfs compression). Operational complexity > the storage savings for us.
Tuning swap aggressively. We mostly avoid swap (set swappiness=10); the workloads are sized to fit in RAM. When we hit swap, we add RAM, not tune.
Multi-pathing for redundancy at the host level. Cloud-managed disks handle this internally.
Default filesystem (ext4 or xfs) with noatime is the baseline. Don't tune unless you have a specific reason.
For NVMe, set IO scheduler to none. Free win on most modern systems.
Watch dirty page accumulation. Default thresholds cause periodic stalls under heavy writes.
Tune read-ahead for your workload. Random access: lower; sequential: higher.
Monitor IO latency, not just throughput. Throughput numbers can hide latency spikes.
Don't disable journaling or barriers. Speed isn't worth data loss.
Keep 15%+ free. Filesystems get slow when full.
Disk performance tuning is one of those areas where defaults are reasonable for most workloads and noticeably wrong for specific ones. The wins are real — 10-20% on the right workload — but the discipline is in measuring before and after, not following a checklist blindly. The cargo-cult version of disk tuning makes things slower more often than faster.
Get the latest tutorials, guides, and insights on AI, DevOps, Cloud, and Infrastructure delivered directly to your inbox.
A real story of removing console-only changes, adding drift detection, and getting Terraform back in charge.
A field report from rolling out retrieval-augmented generation in production, including cache bugs, bad embeddings, and how we fixed them.
Explore more articles in this category
We migrated most scheduled jobs from cron to systemd timers. The wins, the gotchas, and the cases we kept on cron anyway.
A curated list of shell one-liners that earn their place in real ops work — the ones I reach for weekly, not the trick-shot variety.
Generate an SSH key, set up passwordless login, and configure aliases for the servers you use daily — all without copy-pasting yet another long command.