Skip to content

Instantly share code, notes, and snippets.

@pryce-turner
Last active October 23, 2023 16:41
Show Gist options
  • Save pryce-turner/bc14b70ff36ec11e417ef341361b2c5f to your computer and use it in GitHub Desktop.
Save pryce-turner/bc14b70ff36ec11e417ef341361b2c5f to your computer and use it in GitHub Desktop.
Ethereum POS Staking on ZFS

Staking on ZFS

Intro

I always staked on ZFS before the merge, using a number of SATA SSDs in a simple stripe configuration, adding more as my space requirements increased. The merge imposed additional load on my disks that meant my setup was no longer appropriate; this sent me down a long road of testing and optimization. Let me say this up front, there are definitely more performant setups for this than ZFS. I've heard of very good results using mdadm and a simple ext4 filesystem (XFS also works). However, there are so many useful features baked into ZFS (compression, snapshots) and the ergonomics are so good that I was compelled to make this work for my (aging) setup.

Benchmark

I settled on a single fio benchmark for comparing my different setups, based on sar/iostat analyses of working setups. It is as follows: sudo fio --name=randrw --rw=randrw --direct=1 --ioengine=libaio --bs=4k --numjobs=8 --rwmixread=20 --size=1G --runtime=600 --group_reporting. This will lay down several files and perform random reads and writes. I always deleted these files between tests, although that may not be necessary. The read/write mix was based on my execution client (Erigon) being fairly write heavy, I imagine it's similar to other EL clients. Please note that this benchmark will never perfectly capture the IO demands of your setup, it's just a synthetic test to use as a reference.

In my experience, at minimum, you'll want to produce the following from the above benchmark: read: IOPS=6500, BW=25MiB/s write: IOPS=25k, BW=100MiB/s

Pool Geometry

Pool geometry (stripes, mirrors, raidz etc) makes a huge difference in what performance/redudancy profile you're trying to achieve. You could write a book about this so I'll just say that if pure performance is your concern, a simple flat stripe with no parity will be your best bet. However, I think making mirror pairs and striping them would also be very performant and give you easier disaster recovery. I took a lot of the performance recommendations from this excellent article.

Pool properties

ZFS controls almost all of it's tuning through properties on the ZFS dataset. I used the following successfully:

  • recordsize=4K
  • compression=lz4
  • atime=off
  • xattr=sa
  • redundant_metadata=most
  • primarycache=metadata (This will slow down the fio benchmark but in theory should be faster in Erigon since it handles it's own caching)

I got most of these from another great article that goes over a lot of the "why" behind these recommendations. I would stay away from the checksum and the sync properties as you may deeply regret it down the road.

Observations / Things to watch out for

  • My original pool had pathological performance issues that only arose when it was stressed from the merge. I still don't know how those came about, but my recommendation would be to setup everything (geometry, properties) before adding any data to the pool and then don't change them.
  • If you have the know-how, having a backup validator on AWS or the like gives you a lot more freedom to experiment and pay attention to the details. Don't rush these things, there's a lot on the line.
  • Keep detailed notes about all these tests. The numbers start to add up and you'll start second-guessing yourself. I use Trello for all my projects and love it.
  • If you're syncing from scratch on Erigon, I recommend setting --db.pagesize=16K in the Erigon command and setting recordsize=16K to take advantage of ZFS compression. It may sound counter-intuitive but compression presents such a minor compute overhead compared to your IO latency that you'll actually get a performance boost from the disks needing to address fewer sectors than it would if the data were uncompressed.
@yorickdowne
Copy link

On the low end Commit cycle in=124.343244ms and on the high end Commit cycle in=2.240448914s

It's extremely variable, but mostly moves around 1s. 0.5 quantile for engine_newPayloadV1 is 1.06 seconds, 0.9 quantile is 2.13 seconds, and the slowest is 2.86 seconds.

@yorickdowne
Copy link

While I don't know how to improve the performance, I know what doesn't help: The kernel parameters at https://www.percona.com/blog/mysql-zfs-performance-update/ make it far worse with Erigon

@yorickdowne
Copy link

--state.cache 2048 does not improve the commit cycle

@pryce-turner
Copy link
Author

I managed to get caught up and attesting for a while with 4K recordsize but then it began to stumble and fall again, also correlated with free space fragmentation. I agree with you that's it's likely not feasible with SATA SSDs, unless we're missing something big. I'd be very curious to see how it performs on NVMe drives...

@yorickdowne
Copy link

yorickdowne commented Nov 13, 2022

Roger. Moving this over to NVMe then.

Edit: I made a mistake while moving to a new pool and lost my synced DB. Hang loose while this fresh-syncs on NVMe. Won't be very fast as I'm testing on a quad-core.

@mwpastore
Copy link

FYI: My ZFS testing was done on a pair of high-end Gen3 NVMe SSDs.

@abhishektvz
Copy link

Hi, I see that the thread hasn't ended with a conclusion, @pryce-turner @mwpastore @yorickdowne do we have a clear winner when it comes to performance with ext4 ? if yes, is there any tuning that needs to be done over there ?

@pryce-turner
Copy link
Author

@mwpastore were you not able to get it synced on NVMe? Dang, I'm surprised since it was so close on SATA. I'll do some tweaking of my own when I have access to NVMe's again.. until then I'm back and fairly stable on mdadm + xfs.

@yorickdowne were you able to get synced up on NVMe's?

@abhishektvz hey friend, thanks for the bump!

@yorickdowne
Copy link

My sync should finish Monday. Once it is synced I'll get quantiles for NewPayloadV1

@abhishektvz
Copy link

@mwpastore I might have found the culprit behind fragmentation, erigon works using BitTorrent protocol to sync, and I found that it's notoriously known for fragmentation issues due to it's random I/O

https://www.reddit.com/r/zfs/comments/tq6ka7/comment/i2fnmae/?utm_source=share&utm_medium=web2x&context=3

@abhishektvz
Copy link

--snap.stop Workaround to stop producing new snapshots, if you meet some snapshots-related critical bug (default: false)

--snapshots Default: use snapshots "true" for BSC, Mainnet and Goerli. use snapshots "false" in all other cases (default: true)

might want to mark snapshots as false

@abhishektvz
Copy link

If we absolutely want snapshots we can just symlink the ‘’’datadir/snapshots’’’ to another drive as well

@mwpastore
Copy link

I might have found the culprit behind fragmentation, erigon works using BitTorrent protocol to sync, and I found that it's notoriously known for fragmentation issues due to it's random I/O

I don't think snapshot I/O is significant after the initial sync—and I think it only uses BitTorrent protocol to download existing snapshots, not make new snapshots from the working set.

@abhishektvz
Copy link

Ye

I might have found the culprit behind fragmentation, erigon works using BitTorrent protocol to sync, and I found that it's notoriously known for fragmentation issues due to it's random I/O

I don't think snapshot I/O is significant after the initial sync—and I think it only uses BitTorrent protocol to download existing snapshots, not make new snapshots from the working set.

Yes, but since it's downloading snapshots using torrent, it will download them in random order and that could be the fragmentation that @pryce-turner was seeing

@pryce-turner
Copy link
Author

@abhishektvz it's not a bad theory, but almost all of my testing was done by copying the entire erigon datadir between drives. So everything would have gotten laid down contiguously at first, which is confirmed by the initial 0% FRAG. Catching up a few thousand blocks and staying in sync caused the FRAG to go up.

Still not necessarily convinced that's where all the performance degradation is even coming from, but it would be nice to rule it out..

@abhishektvz
Copy link

Yep, and another thing we need might to check is if mdbx the KV DB used by erigon might be causing this, we can try running a separate instance and try random I/O over there, https://github.com/erthink/libmdbx

@yorickdowne
Copy link

NVME does not really help.

rpc_duration_seconds{method="engine_newPayloadV1",success="success",quantile="0.5"} 1.237691304
rpc_duration_seconds{method="engine_newPayloadV1",success="success",quantile="0.9"} 2.115248718
rpc_duration_seconds{method="engine_newPayloadV1",success="success",quantile="0.97"} 2.399096843
rpc_duration_seconds{method="engine_newPayloadV1",success="success",quantile="0.99"} 2.399096843
rpc_duration_seconds{method="engine_newPayloadV1",success="success",quantile="1"} 2.399096843

@yorickdowne
Copy link

I stopped Erigon, copied snapshots dir so it wasn't fragmented, and started Erigon. It seems to make a difference - it's still abysmal.

rpc_duration_seconds{method="engine_newPayloadV1",success="success",quantile="0.5"} 1.150537948
rpc_duration_seconds{method="engine_newPayloadV1",success="success",quantile="0.9"} 1.4375216370000001
rpc_duration_seconds{method="engine_newPayloadV1",success="success",quantile="0.97"} 1.613267295
rpc_duration_seconds{method="engine_newPayloadV1",success="success",quantile="0.99"} 1.613267295
rpc_duration_seconds{method="engine_newPayloadV1",success="success",quantile="1"} 1.613267295

@pryce-turner
Copy link
Author

Yeesh, yeah that's not great. Apparently OpenZFS is trending towards features that are more SSD focused, so maybe there will be a big release at some point that fixes this issue for our usecase.. doesn't seem viable at the moment though.

@abhishektvz I'd be curious to see the results of that experiment!

@yorickdowne how are you generating those stats btw?

@yorickdowne
Copy link

@pryce-turner Curl the metrics interface and grep for engine_newPayloadV1

@pryce-turner
Copy link
Author

@pryce-turner Curl the metrics interface and grep for engine_newPayloadV1

Awesome, thanks!

@yorickdowne
Copy link

yorickdowne commented Jan 5, 2023

The 4erigon_2 branch has improved results.

ZFS settings

recordsize             16k
compression            lz4
atime                  off
xattr                  sa
primarycache           all
logbias                throughput
sync                   standard
relatime               off

I get compressratio 1.38x with lz4. This can likely be run without issue on zstd-fast to get more compression.

Quantiles are better, but not "amazing" with WRITE_MAP=true.

rpc_duration_seconds{method="engine_newPayloadV1",success="success",quantile="0.5"} 0.948885932
rpc_duration_seconds{method="engine_newPayloadV1",success="success",quantile="0.9"} 1.3364602159999999
rpc_duration_seconds{method="engine_newPayloadV1",success="success",quantile="0.97"} 1.405461775
rpc_duration_seconds{method="engine_newPayloadV1",success="success",quantile="0.99"} 1.5774590640000001
rpc_duration_seconds{method="engine_newPayloadV1",success="success",quantile="1"} 1.5774590640000001

There's an fsync happening. If I yolo this and set sync=disabled I get different values:

rpc_duration_seconds{method="engine_newPayloadV1",success="success",quantile="0.5"} 0.125063721
rpc_duration_seconds{method="engine_newPayloadV1",success="success",quantile="0.9"} 1.192553632
rpc_duration_seconds{method="engine_newPayloadV1",success="success",quantile="0.97"} 4.362780238
rpc_duration_seconds{method="engine_newPayloadV1",success="success",quantile="0.99"} 4.362780238
rpc_duration_seconds{method="engine_newPayloadV1",success="success",quantile="1"} 4.362780238

sync disabled and write_map off as well:

rpc_duration_seconds{method="engine_newPayloadV1",success="success",quantile="0.5"} 0.116620899
rpc_duration_seconds{method="engine_newPayloadV1",success="success",quantile="0.9"} 0.833232106
rpc_duration_seconds{method="engine_newPayloadV1",success="success",quantile="0.97"} 1.035332535
rpc_duration_seconds{method="engine_newPayloadV1",success="success",quantile="0.99"} 1.035332535
rpc_duration_seconds{method="engine_newPayloadV1",success="success",quantile="1"} 1.035332535

More testing needed and:

  • The new branch / mdbx version helps
  • ZFS has an issue with sync writes, which is well documented and known
  • write_map is not necessary, and possibly not helpful

I am redoing this test with zstd-fast

@pryce-turner
Copy link
Author

Thanks @yorickdowne, that's good stuff. Interesting to see that disabling sync makes things that much worse... unless I'm reading it wrong. Is this all still on your SATAs or have you switched to NVMe?

@yorickdowne
Copy link

This is NVMe, and disabling sync made it better, not worse.

That said I've retested this, and it's still abysmal. I'll stop testing, and may resume once OpenZFS 2.2 has been released and is available on Ubuntu 22.04.

Leadership meeting of 1/31 had "Will 2.2 be branched at some point soon?" - the video recording is not (yet) on Youtube so I don't know what was said.

@pryce-turner
Copy link
Author

Gotcha - thanks for clarifying. Is there any optimization in particular for 2.2 you're hoping for, or just a new version that might be better?

@yorickdowne
Copy link

Directio, and docker overlayfs support

@pryce-turner
Copy link
Author

Got it - cheers

@j4ys0n
Copy link

j4ys0n commented Sep 1, 2023

what did y'all land on for the zfs config? i've got a geth archive node on zfs that i need to do something with soon. it's filling up quite quickly these days.

# zfs get all nvme1
NAME   PROPERTY              VALUE                  SOURCE
nvme1  used                  21.5T                  -
nvme1  available             80.1G                  -
nvme1  compressratio         1.00x                  -
nvme1  recordsize            128K                   default
nvme1  compression           off                    default

setup is currently 12x 4tb firecuda drives in the zfs equivalent of raid10. as you can see above, it's almost full. strange thing is, zfs tells me ~80gb is free, while proxmox tells me ~257gb is free - but compression ratio is 1.0. i could have sworn compression was on when i created it, but looks like i'm wrong. anyway, running out of space. and for the record, i'm using teku in conjunction with geth. that's on a separate raid10 nvme volume and is using around 220gb.

this node in particular isn't staking, but i do have a few staking nodes also (on different servers). so i'd likely apply the same methodology. here's what i'm thinking, in no particular order.

  1. get bigger drives. the 8tb corsair drives have come down in price a bit. though, that's still pretty spendy and i'd rather not.
  2. reconfigure from raid10 to draid2 or draid3 and just keep going.
  3. leave the volume as is and use erigon. it's been a while but last time i played with it syncing an archive node had a few issues.
  4. reconfigure the volume and use erigon - which i have a feeling is the winner, but i'm not sure.

i do have a pretty big sata volume on this machine also, i ran the archive node on that for a bit, but it killed a few of the drives (ironwolf 4tb) and the warranty process was a nightmare. the nvme volume has been pretty worry free so far.

on researching zfs's newer draid vdevs, it seems like block size / record size plays a big factor in performance of that particular configuration.

@pryce-turner
Copy link
Author

Hey @j4ys0n, sorry only just getting to this... Are you having performance issues or are you just running out of space and want to do things as best as possible when setting up your new array?

As far as what we landed on with zfs (apart from don't do it 😅) I think hasn't deviated too much from the original recommendations. Bring your recordsize down and try to match it with the db.pagesize (not sure how to do that in geth but that's mentioned in the original for erigon). You should be able to turn compression on with very little overhead.

As far as geometry goes.. it kinda depends on how much downtime you can tolerate. If I were you I would get fewer, larger NVMe drives (to reduce the stripe width). I'd make a fast and loose pool with basic striping (no redundancy, max performance) and then get some much cheaper TBs in another pool, can even be spinners, to backup to. The beauty of ZFS as you well know (and COW in general) is that the snapshot/replication to the backup pool will be super fast since it's only updating changed blocks. Since you have 2 copies of the data anyways (raid10 mirrors) you mind as well save some money and get a performance boost. Again, main downside is downtime if a drive dies in your main stripe. My 2c.

I can't comment on draid performance, looks super cool and I'm sure is stable, just hasn't been out in the wild long enough. Hope that helps! There's also an awesome channel in the erigon discord for zfs optimizations fyi. Good luck!

@yorickdowne
Copy link

yorickdowne commented Sep 6, 2023

Draid isn’t any faster than raidz - on a per vdev or several vdev basis. The point of draid is fast hot standby, for setups that would otherwise use several raidz vdevs in the pool.

For the DB though you don’t want raidz, performance is going to be even worse. Mirror setup is the way for speed.

Now with zfs you can replace one drive in the mirror, replace the other, and you have the higher capacity. That’s one option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment