Migrated our build server array from ext4+mdadm to ZFS on Linux six months ago. Here’s what I learned.
Why ZFS
- Checksumming catches silent data corruption (we found 14 affected files on the old array)
- Snapshots are cheap and instant (100ms for a 10TB dataset)
- Compression often makes things faster — less I/O, more CPU
- Send/receive for efficient replication
- No separate mdadm/LVM layer to debug
Pool design
For the build server, 6 x 4TB NVMe in RAIDZ2:
zpool create -o ashift=12 \
-O compression=zstd-3 \
-O atime=off \
-O xattr=sa \
-O acltype=posixacl \
-O recordsize=1M \
buildpool raidz2 nvme0n1 nvme1n1 nvme2n1 nvme3n1 nvme4n1 nvme5n1
Notes:
ashift=12for 4K sector drives (almost all modern ones)compression=zstd-3— good balance, better than lz4 for most dataatime=offavoids write amplification for read-only workloadsxattr=sastores extended attributes in the inode (faster)recordsize=1Mfor large-file workloads (databases want smaller, 8K-16K)
Memory usage
ARC (adaptive replacement cache) uses half of RAM by default. For dedicated storage servers, bump to 80%:
echo "options zfs zfs_arc_max=25769803776" > /etc/modprobe.d/zfs.conf # 24 GB
On memory-constrained hosts running other workloads, cap lower.
Snapshots and send/receive
Snapshot daily:
zfs snapshot buildpool/data@daily-$(date +%F)
Replicate incrementally:
zfs send -i @yesterday buildpool/data@today | ssh backup zfs recv backup/data
The first full send is slow, subsequent incremental sends are very fast.
Performance tuning
- Enable async destroy:
zfs destroy -rwithout-dcan block for minutes on large pools. Newer kernels handle this better. - Set proper recordsize for your workload. Wrong choice causes read/write amplification.
- Watch
zpool iostat -vduring production load. Look for unbalanced devices. - Schedule scrubs weekly or monthly depending on pool size.
Things that bit me
- Forgot to set
canmount=offon a dataset; auto-mounted at wrong path - Used
acltype=posixon older ZFS version; should beposixacl - Snapshot accumulation: didn’t set retention, disk filled up silently
zfs set readonly=ondoesn’t propagate to children — need-ror per-dataset
Would I do it again?
Yes. The data integrity guarantees alone justify the learning curve. Snapshots for rollback during deploys are a game-changer.