pryce-turner/staking_zfs.md

## staking_zfs.md

      
    Raw
  

              staking_zfs.md
            
          
    Staking on ZFS

Intro

I always staked on ZFS before the merge, using a number of SATA SSDs in a simple stripe configuration, adding more as my space requirements increased. The merge imposed additional load on my disks that meant my setup was no longer appropriate; this sent me down a long road of testing and optimization. Let me say this up front, there are definitely more performant setups for this than ZFS. I've heard of very good results using mdadm and a simple ext4 filesystem (XFS also works). However, there are so many useful features baked into ZFS (compression, snapshots) and the ergonomics are so good that I was compelled to make this work for my (aging) setup.
Benchmark

I settled on a single fio benchmark for comparing my different setups, based on sar/iostat analyses of working setups. It is as follows: sudo fio --name=randrw --rw=randrw --direct=1 --ioengine=libaio --bs=4k --numjobs=8 --rwmixread=20 --size=1G --runtime=600 --group_reporting. This will lay down several files and perform random reads and writes. I always deleted these files between tests, although that may not be necessary. The read/write mix was based on my execution client (Erigon) being fairly write heavy, I imagine it's similar to other EL clients. Please note that this benchmark will never perfectly capture the IO demands of your setup, it's just a synthetic test to use as a reference.
In my experience, at minimum, you'll want to produce the following from the above benchmark:
read: IOPS=6500, BW=25MiB/s
write: IOPS=25k, BW=100MiB/s
Pool Geometry

Pool geometry (stripes, mirrors, raidz etc) makes a huge difference in what performance/redudancy profile you're trying to achieve. You could write a book about this so I'll just say that if pure performance is your concern, a simple flat stripe with no parity will be your best bet. However, I think making mirror pairs and striping them would also be very performant and give you easier disaster recovery. I took a lot of the performance recommendations from this excellent article.
Pool properties

ZFS controls almost all of it's tuning through properties on the ZFS dataset. I used the following successfully:

recordsize=4K
compression=lz4
atime=off
xattr=sa
redundant_metadata=most
primarycache=metadata (This will slow down the fio benchmark but in theory should be faster in Erigon since it handles it's own caching)

I got most of these from another great article that goes over a lot of the "why" behind these recommendations. I would stay away from the checksum and the sync properties as you may deeply regret it down the road.
Observations / Things to watch out for


My original pool had pathological performance issues that only arose when it was stressed from the merge. I still don't know how those came about, but my recommendation would be to setup everything (geometry, properties) before adding any data to the pool and then don't change them.
If you have the know-how, having a backup validator on AWS or the like gives you a lot more freedom to experiment and pay attention to the details. Don't rush these things, there's a lot on the line.
Keep detailed notes about all these tests. The numbers start to add up and you'll start second-guessing yourself. I use Trello for all my projects and love it.
If you're syncing from scratch on Erigon, I recommend setting --db.pagesize=16K in the Erigon command and setting recordsize=16K to take advantage of ZFS compression. It may sound counter-intuitive but compression presents such a minor compute overhead compared to your IO latency that you'll actually get a performance boost from the disks needing to address fewer sectors than it would if the data were uncompressed.