Filesystem choice, mount options, IO schedulers — the per-host tweaks that actually moved disk performance for our database and storage workloads.

On this page

Filesystem Optimization for Production Storage

Filesystem and disk-level tuning is one of those topics where defaults are fine for general use and meaningfully wrong for specific workloads. We learned this the slow way — debugging database performance issues that turned out to be IO-related. This post is the working playbook for filesystem and disk tuning, with the actual measurable improvements we've seen.

When this matters #

Disk tuning matters most when:

You run write-heavy workloads (databases, message queues, log aggregators)
p99 latency matters and IO is on the critical path
You're at a scale where 10-20% throughput gain is real money

For most application servers (web frontends, stateless APIs), default filesystem settings are fine. Don't tune for the sake of tuning.

Filesystem choice #

For production Linux, the realistic options are ext4 and xfs. We use both:

ext4: default on most distros. Mature, stable, well-supported. Good for most workloads up to medium size.

xfs: better at scaling to large filesystems and parallel writes. Default for RHEL family. Our database servers (Postgres) use xfs.

Specific reasons we'd pick one over the other:

Database with heavy parallel writes: xfs (better internal locking)
Many small files (Maildir-style): ext4 (lower overhead per file)
Very large filesystem (>16TB): xfs (handles scale better)
General-purpose: ext4 is fine

We don't use btrfs in production. Cool features but operationally fiddly. We don't use ZFS on Linux for the same reason (great filesystem but operational complexity).

Mount options #

These earn their place on most production filesystems:

code

/dev/nvme1n1 /var/lib/postgresql ext4 defaults,noatime,nodiratime 0 2

noatime: stop updating access timestamps on every read. Default behavior turns reads into writes (updating the inode's atime), which is wasteful.

nodiratime: same for directory access times.

For our database server, noatime reduced IO by ~6%. Free.

For xfs:

code

/dev/nvme1n1 /var/lib/data xfs defaults,noatime,nodiratime,attr2,inode64 0 2

inode64: allows inodes to be allocated anywhere. Default is to put inodes only in the first 1TB of the filesystem; with inode64, large filesystems can have inodes anywhere. Necessary for filesystems > 1TB.

What we don't enable:

data=writeback on ext4 (faster writes, more data loss risk on crash). The default data=ordered is safer.
nobarrier (disables write barriers). Faster but unsafe on power loss.
nodelalloc (disables delayed allocation). Sometimes recommended; we haven't seen consistent benefit.

The pattern: tweak for things that are clearly safe; don't disable safety features for marginal performance gains.

IO scheduler #

On modern NVMe SSDs, the kernel's IO scheduler often adds latency. Default schedulers are designed for spinning disks; for NVMe, they often hurt.

sh.sh

echo none > /sys/block/nvme0n1/queue/scheduler

For our NVMe-backed database, switching from mq-deadline (default) to none reduced p99 read latency by ~12% and improved IOPS by ~8%. NVMe handles its own queueing; the kernel's queue management is redundant.

For magnetic disks (rare in our fleet now), mq-deadline or bfq work better.

Persist across reboots via udev:

code

# /etc/udev/rules.d/60-iosched.rules
ACTION=="add|change", KERNEL=="nvme[0-9]n[0-9]*", ATTR{queue/scheduler}="none"

Read-ahead #

The kernel reads ahead of the actual read request, anticipating sequential access. For random-access workloads (databases), aggressive read-ahead wastes IO.

sh.sh

blockdev --setra 256 /dev/nvme0n1     # 256 blocks of 512 bytes = 128KB

For Postgres with random access patterns, we set read-ahead to 256 (lower than the 4096 default on many distros). Reduced IO load by ~10% with no read latency impact.

For sequential workloads (log writes, large file copies), keep higher read-ahead.

Persist via tune2fs (ext4) or via boot script.

Filesystem alignment #

For NVMe SSDs and modern filesystems, alignment is usually correct out of the box. Worth checking with new disks:

sh.sh

parted /dev/nvme0n1 align-check optimal 1

Misaligned filesystems can be ~10-30% slower. Modern tooling avoids this; legacy systems might have issues.

Block sizes #

mkfs.ext4 -b <block-size> controls block size. Larger blocks = less metadata overhead but more space waste for small files.

For databases that mostly do 8KB or 16KB IO (Postgres pages are 8KB by default), ext4's default 4KB block size is fine. Don't try to "optimize" by matching block sizes; the kernel handles it.

Hugepages #

For Postgres specifically, transparent hugepages (THP) used to be recommended off. Modern Postgres with modern kernel handles THP fine. We leave it on default.

For some workloads (HotSpot JVM, custom databases), explicit hugepages help. We don't use them in our stack.

Filesystem cache and dirty pages #

code

vm.dirty_ratio = 15
vm.dirty_background_ratio = 5
vm.dirty_expire_centisecs = 1000

Mentioned in the Linux performance post; relevant here too. These control when dirty pages get flushed to disk. Defaults are too lazy for high-write workloads — when the kernel finally flushes, it stalls everything.

For our database server, lower dirty thresholds reduced write latency variance significantly. p99 write latency went from spiky (sometimes 200ms+) to steady (~30ms).

Per-filesystem stripe size #

For RAID or LVM-striped volumes, the stripe size matters. ext4 wants to know about it via mkfs options:

sh.sh

mkfs.ext4 -E stride=16,stripe-width=32 /dev/...

Where stride = chunk_size / block_size and stripe-width = stride * num_data_disks. If the filesystem doesn't know the stripe geometry, writes can fall awkwardly across stripes, hurting performance.

Most cloud-managed disks aren't user-RAIDed (the cloud handles the redundancy at a lower layer), so this is less common. For specific high-IOPS setups, it matters.

Disk monitoring #

Once you've tuned, monitor:

IOPS (read and write separately)
Throughput (MB/s)
Latency p50/p99
Queue depth
Utilization percentage

Tools:

iostat -x 1: per-device stats. The await column is average IO latency; svctm is service time. Both worth watching.
iotop: which processes are doing IO. Useful for "why is the disk pegged."
nvme smart-log: NVMe-specific health (wear leveling, error counts, temperature). NVMe SSDs have a finite write endurance; SMART tells you how close to it you are.

For Prometheus, node_exporter exposes node_disk_* metrics that capture all of this.

Specific filesystem incidents we've debugged #

Production issues:

Postgres slowing down dramatically every 5 minutes. The vm.dirty_* ratios were at default. Every 5 minutes the kernel would flush GB of dirty pages, IO saturated, queries stalled. Tuned the ratios; problem gone.

Database server's NVMe wearing out faster than expected. Heavy write amplification because of how data=ordered writes journal entries. We moved logs to a separate disk and kept data on the main NVMe. Per-disk write rate dropped, wear leveling improved.

File operations slow on a large filesystem. Discovered the filesystem was 80%+ full. ext4 (and other CoW or journaling filesystems) gets slower as it fills up. The right answer wasn't "tune more" — it was "free up space" or "extend the filesystem."

XFS error during a kernel upgrade. A specific kernel had an XFS regression. Rolled back the kernel; reported upstream. Subsequent kernel fixed it. Lesson: don't rush kernel upgrades on database hosts.

Cloud / managed storage specifics #

For AWS EBS:

gp3 is the default for most workloads. Configure IOPS and throughput separately from size.
io2 for very high-IOPS needs. Significantly more expensive.
Don't use gp2 anymore (older generation); gp3 is cheaper and faster.

For specific high-IOPS database workloads, we've moved from gp2 to gp3 with provisioned IOPS. Cost is similar; performance is meaningfully better.

For high-throughput sequential workloads, st1 (HDD-based, throughput-optimized) is much cheaper than SSD. Used for our log archival; not for anything latency-sensitive.

What we don't bother with #

Tweaks that don't earn their place for our workloads:

Custom kernel I/O policies via cgroups. Useful for multi-tenant scenarios; we don't have meaningful tenancy at this layer.

Filesystem-specific compression (e.g., btrfs/zfs compression). Operational complexity > the storage savings for us.

Tuning swap aggressively. We mostly avoid swap (set swappiness=10); the workloads are sized to fit in RAM. When we hit swap, we add RAM, not tune.

Multi-pathing for redundancy at the host level. Cloud-managed disks handle this internally.

What I'd tell a team starting #

Default filesystem (ext4 or xfs) with noatime is the baseline. Don't tune unless you have a specific reason.

For NVMe, set IO scheduler to none. Free win on most modern systems.

Watch dirty page accumulation. Default thresholds cause periodic stalls under heavy writes.

Tune read-ahead for your workload. Random access: lower; sequential: higher.

Monitor IO latency, not just throughput. Throughput numbers can hide latency spikes.

Don't disable journaling or barriers. Speed isn't worth data loss.

Keep 15%+ free. Filesystems get slow when full.

Disk performance tuning is one of those areas where defaults are reasonable for most workloads and noticeably wrong for specific ones. The wins are real — 10-20% on the right workload — but the discipline is in measuring before and after, not following a checklist blindly. The cargo-cult version of disk tuning makes things slower more often than faster.

File System Optimization: Improving Disk Performance

Filesystem Optimization for Production Storage

When this matters #

Filesystem choice #

Mount options #

IO scheduler #

Read-ahead #

Filesystem alignment #

Block sizes #

Hugepages #

Filesystem cache and dirty pages #

Per-filesystem stripe size #

Disk monitoring #

Specific filesystem incidents we've debugged #

Cloud / managed storage specifics #

What we don't bother with #

What I'd tell a team starting #

Stay Updated

How We Stopped Terraform Drift from Surprising On-Call

Real-World RAG Incidents: Lessons from a Production Rollout

More from Linux

SSH Hardening in 2026: Keys, Certificates, and Bastion Patterns

Linux TCP Tuning for High-Throughput Services

Debugging Latency with eBPF: bpftrace One-Liners That Find It

SSH Hardening in 2026: Keys, Certificates, and Bastion Patterns

Linux TCP Tuning for High-Throughput Services

Debugging Latency with eBPF: bpftrace One-Liners That Find It

systemd Timers vs Cron: Migrating Scheduled Jobs the Right Way

External Secrets Operator: One Secrets Workflow Across Clouds

Four Signals That Matter: Choosing SLIs Users Actually Feel

About Kiril Urbonas

You might have missed

GitOps with Argo CD: Best Practices for 2025

Prompt Engineering Best Practices: Maximizing LLM Performance

AI Agents in DevOps: From Copilots to Autonomous Automation in 2025