Migrated our build server array from ext4+mdadm to ZFS on Linux six months ago. Here’s what I learned.

Why ZFS

  • Checksumming catches silent data corruption (we found 14 affected files on the old array)
  • Snapshots are cheap and instant (100ms for a 10TB dataset)
  • Compression often makes things faster — less I/O, more CPU
  • Send/receive for efficient replication
  • No separate mdadm/LVM layer to debug

Pool design

For the build server, 6 x 4TB NVMe in RAIDZ2:

zpool create -o ashift=12 \
  -O compression=zstd-3 \
  -O atime=off \
  -O xattr=sa \
  -O acltype=posixacl \
  -O recordsize=1M \
  buildpool raidz2 nvme0n1 nvme1n1 nvme2n1 nvme3n1 nvme4n1 nvme5n1

Notes:

  • ashift=12 for 4K sector drives (almost all modern ones)
  • compression=zstd-3 — good balance, better than lz4 for most data
  • atime=off avoids write amplification for read-only workloads
  • xattr=sa stores extended attributes in the inode (faster)
  • recordsize=1M for large-file workloads (databases want smaller, 8K-16K)

Memory usage

ARC (adaptive replacement cache) uses half of RAM by default. For dedicated storage servers, bump to 80%:

echo "options zfs zfs_arc_max=25769803776" > /etc/modprobe.d/zfs.conf  # 24 GB

On memory-constrained hosts running other workloads, cap lower.

Snapshots and send/receive

Snapshot daily:

zfs snapshot buildpool/data@daily-$(date +%F)

Replicate incrementally:

zfs send -i @yesterday buildpool/data@today | ssh backup zfs recv backup/data

The first full send is slow, subsequent incremental sends are very fast.

Performance tuning

  • Enable async destroy: zfs destroy -r without -d can block for minutes on large pools. Newer kernels handle this better.
  • Set proper recordsize for your workload. Wrong choice causes read/write amplification.
  • Watch zpool iostat -v during production load. Look for unbalanced devices.
  • Schedule scrubs weekly or monthly depending on pool size.

Things that bit me

  • Forgot to set canmount=off on a dataset; auto-mounted at wrong path
  • Used acltype=posix on older ZFS version; should be posixacl
  • Snapshot accumulation: didn’t set retention, disk filled up silently
  • zfs set readonly=on doesn’t propagate to children — need -r or per-dataset

Would I do it again?

Yes. The data integrity guarantees alone justify the learning curve. Snapshots for rollback during deploys are a game-changer.