Skip to content

Instantly share code, notes, and snippets.

@pryce-turner
Last active October 23, 2023 16:41
Show Gist options
  • Save pryce-turner/bc14b70ff36ec11e417ef341361b2c5f to your computer and use it in GitHub Desktop.
Save pryce-turner/bc14b70ff36ec11e417ef341361b2c5f to your computer and use it in GitHub Desktop.
Ethereum POS Staking on ZFS

Staking on ZFS

Intro

I always staked on ZFS before the merge, using a number of SATA SSDs in a simple stripe configuration, adding more as my space requirements increased. The merge imposed additional load on my disks that meant my setup was no longer appropriate; this sent me down a long road of testing and optimization. Let me say this up front, there are definitely more performant setups for this than ZFS. I've heard of very good results using mdadm and a simple ext4 filesystem (XFS also works). However, there are so many useful features baked into ZFS (compression, snapshots) and the ergonomics are so good that I was compelled to make this work for my (aging) setup.

Benchmark

I settled on a single fio benchmark for comparing my different setups, based on sar/iostat analyses of working setups. It is as follows: sudo fio --name=randrw --rw=randrw --direct=1 --ioengine=libaio --bs=4k --numjobs=8 --rwmixread=20 --size=1G --runtime=600 --group_reporting. This will lay down several files and perform random reads and writes. I always deleted these files between tests, although that may not be necessary. The read/write mix was based on my execution client (Erigon) being fairly write heavy, I imagine it's similar to other EL clients. Please note that this benchmark will never perfectly capture the IO demands of your setup, it's just a synthetic test to use as a reference.

In my experience, at minimum, you'll want to produce the following from the above benchmark: read: IOPS=6500, BW=25MiB/s write: IOPS=25k, BW=100MiB/s

Pool Geometry

Pool geometry (stripes, mirrors, raidz etc) makes a huge difference in what performance/redudancy profile you're trying to achieve. You could write a book about this so I'll just say that if pure performance is your concern, a simple flat stripe with no parity will be your best bet. However, I think making mirror pairs and striping them would also be very performant and give you easier disaster recovery. I took a lot of the performance recommendations from this excellent article.

Pool properties

ZFS controls almost all of it's tuning through properties on the ZFS dataset. I used the following successfully:

  • recordsize=4K
  • compression=lz4
  • atime=off
  • xattr=sa
  • redundant_metadata=most
  • primarycache=metadata (This will slow down the fio benchmark but in theory should be faster in Erigon since it handles it's own caching)

I got most of these from another great article that goes over a lot of the "why" behind these recommendations. I would stay away from the checksum and the sync properties as you may deeply regret it down the road.

Observations / Things to watch out for

  • My original pool had pathological performance issues that only arose when it was stressed from the merge. I still don't know how those came about, but my recommendation would be to setup everything (geometry, properties) before adding any data to the pool and then don't change them.
  • If you have the know-how, having a backup validator on AWS or the like gives you a lot more freedom to experiment and pay attention to the details. Don't rush these things, there's a lot on the line.
  • Keep detailed notes about all these tests. The numbers start to add up and you'll start second-guessing yourself. I use Trello for all my projects and love it.
  • If you're syncing from scratch on Erigon, I recommend setting --db.pagesize=16K in the Erigon command and setting recordsize=16K to take advantage of ZFS compression. It may sound counter-intuitive but compression presents such a minor compute overhead compared to your IO latency that you'll actually get a performance boost from the disks needing to address fewer sectors than it would if the data were uncompressed.
@pryce-turner
Copy link
Author

Thanks @yorickdowne, that's good stuff. Interesting to see that disabling sync makes things that much worse... unless I'm reading it wrong. Is this all still on your SATAs or have you switched to NVMe?

@yorickdowne
Copy link

This is NVMe, and disabling sync made it better, not worse.

That said I've retested this, and it's still abysmal. I'll stop testing, and may resume once OpenZFS 2.2 has been released and is available on Ubuntu 22.04.

Leadership meeting of 1/31 had "Will 2.2 be branched at some point soon?" - the video recording is not (yet) on Youtube so I don't know what was said.

@pryce-turner
Copy link
Author

Gotcha - thanks for clarifying. Is there any optimization in particular for 2.2 you're hoping for, or just a new version that might be better?

@yorickdowne
Copy link

Directio, and docker overlayfs support

@pryce-turner
Copy link
Author

Got it - cheers

@j4ys0n
Copy link

j4ys0n commented Sep 1, 2023

what did y'all land on for the zfs config? i've got a geth archive node on zfs that i need to do something with soon. it's filling up quite quickly these days.

# zfs get all nvme1
NAME   PROPERTY              VALUE                  SOURCE
nvme1  used                  21.5T                  -
nvme1  available             80.1G                  -
nvme1  compressratio         1.00x                  -
nvme1  recordsize            128K                   default
nvme1  compression           off                    default

setup is currently 12x 4tb firecuda drives in the zfs equivalent of raid10. as you can see above, it's almost full. strange thing is, zfs tells me ~80gb is free, while proxmox tells me ~257gb is free - but compression ratio is 1.0. i could have sworn compression was on when i created it, but looks like i'm wrong. anyway, running out of space. and for the record, i'm using teku in conjunction with geth. that's on a separate raid10 nvme volume and is using around 220gb.

this node in particular isn't staking, but i do have a few staking nodes also (on different servers). so i'd likely apply the same methodology. here's what i'm thinking, in no particular order.

  1. get bigger drives. the 8tb corsair drives have come down in price a bit. though, that's still pretty spendy and i'd rather not.
  2. reconfigure from raid10 to draid2 or draid3 and just keep going.
  3. leave the volume as is and use erigon. it's been a while but last time i played with it syncing an archive node had a few issues.
  4. reconfigure the volume and use erigon - which i have a feeling is the winner, but i'm not sure.

i do have a pretty big sata volume on this machine also, i ran the archive node on that for a bit, but it killed a few of the drives (ironwolf 4tb) and the warranty process was a nightmare. the nvme volume has been pretty worry free so far.

on researching zfs's newer draid vdevs, it seems like block size / record size plays a big factor in performance of that particular configuration.

@pryce-turner
Copy link
Author

Hey @j4ys0n, sorry only just getting to this... Are you having performance issues or are you just running out of space and want to do things as best as possible when setting up your new array?

As far as what we landed on with zfs (apart from don't do it 😅) I think hasn't deviated too much from the original recommendations. Bring your recordsize down and try to match it with the db.pagesize (not sure how to do that in geth but that's mentioned in the original for erigon). You should be able to turn compression on with very little overhead.

As far as geometry goes.. it kinda depends on how much downtime you can tolerate. If I were you I would get fewer, larger NVMe drives (to reduce the stripe width). I'd make a fast and loose pool with basic striping (no redundancy, max performance) and then get some much cheaper TBs in another pool, can even be spinners, to backup to. The beauty of ZFS as you well know (and COW in general) is that the snapshot/replication to the backup pool will be super fast since it's only updating changed blocks. Since you have 2 copies of the data anyways (raid10 mirrors) you mind as well save some money and get a performance boost. Again, main downside is downtime if a drive dies in your main stripe. My 2c.

I can't comment on draid performance, looks super cool and I'm sure is stable, just hasn't been out in the wild long enough. Hope that helps! There's also an awesome channel in the erigon discord for zfs optimizations fyi. Good luck!

@yorickdowne
Copy link

yorickdowne commented Sep 6, 2023

Draid isn’t any faster than raidz - on a per vdev or several vdev basis. The point of draid is fast hot standby, for setups that would otherwise use several raidz vdevs in the pool.

For the DB though you don’t want raidz, performance is going to be even worse. Mirror setup is the way for speed.

Now with zfs you can replace one drive in the mirror, replace the other, and you have the higher capacity. That’s one option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment