ZFS on Linux: Six Months of Production Use

Migrated our build server array from ext4+mdadm to ZFS on Linux six months ago. Here’s what I learned.

Why ZFS

Checksumming catches silent data corruption (we found 14 affected files on the old array)
Snapshots are cheap and instant (100ms for a 10TB dataset)
Compression often makes things faster — less I/O, more CPU
Send/receive for efficient replication
No separate mdadm/LVM layer to debug

Pool design

For the build server, 6 x 4TB NVMe in RAIDZ2:

zpool create -o ashift=12 \
  -O compression=zstd-3 \
  -O atime=off \
  -O xattr=sa \
  -O acltype=posixacl \
  -O recordsize=1M \
  buildpool raidz2 nvme0n1 nvme1n1 nvme2n1 nvme3n1 nvme4n1 nvme5n1

Notes:

ashift=12 for 4K sector drives (almost all modern ones)
compression=zstd-3 — good balance, better than lz4 for most data
atime=off avoids write amplification for read-only workloads
xattr=sa stores extended attributes in the inode (faster)
recordsize=1M for large-file workloads (databases want smaller, 8K-16K)

Memory usage

ARC (adaptive replacement cache) uses half of RAM by default. For dedicated storage servers, bump to 80%:

echo "options zfs zfs_arc_max=25769803776" > /etc/modprobe.d/zfs.conf  # 24 GB

On memory-constrained hosts running other workloads, cap lower.

Snapshots and send/receive

Snapshot daily:

zfs snapshot buildpool/data@daily-$(date +%F)

Replicate incrementally:

zfs send -i @yesterday buildpool/data@today | ssh backup zfs recv backup/data

The first full send is slow, subsequent incremental sends are very fast.

Performance tuning

Enable async destroy: zfs destroy -r without -d can block for minutes on large pools. Newer kernels handle this better.
Set proper recordsize for your workload. Wrong choice causes read/write amplification.
Watch zpool iostat -v during production load. Look for unbalanced devices.
Schedule scrubs weekly or monthly depending on pool size.

Things that bit me

Forgot to set canmount=off on a dataset; auto-mounted at wrong path
Used acltype=posix on older ZFS version; should be posixacl
Snapshot accumulation: didn’t set retention, disk filled up silently
zfs set readonly=on doesn’t propagate to children — need -r or per-dataset

Would I do it again?

Yes. The data integrity guarantees alone justify the learning curve. Snapshots for rollback during deploys are a game-changer.

Why ZFS#

Pool design#

Memory usage#

Snapshots and send/receive#

Performance tuning#

Things that bit me#

Would I do it again?#