Skip to content

Instantly share code, notes, and snippets.

@yorickdowne
Last active June 1, 2023 02:58
Embed
What would you like to do?
Great and less great SSDs for Ethereum nodes

Overview

Syncing an Ethereum node is largely reliant on IOPS, I/O Per Second. Budget SSDs will struggle to an extent, and some won't be able to sync at all.

This document aims to snapshot some known good and known bad models.

For size, 2TB come recommended as of mid-2022. 1TB can work for now but is getting tight.

High-level, QLC and DRAMless are far slower than "mainstream" SSDs.

IOPS wise, it's likely Geth, then Besu, then Nethermind in ascending order of IOPS requirements. I am not quite sure where Erigon fits these days.

Other than a slow SSD model, these are things that can slow IOPS down:

  • Heat. Check with smartctl -x; the SSD should be below 50C so it does not throttle.
  • TRIM not being allowed. This can happen with some hardware RAID controllers, as well as on macOS with non-Apple SSDs
  • On SATA, the controller in UEFI/BIOS set to anything other than AHCI. Set it to AHCI for good performance.

If you haven't already, do turn off atime on your DB volume, it'll increase SSD lifetime and speed things up a little bit.

The drive lists are ordered by interface and alphabetically by vendor name, not by preference. The lists are not exhaustive at all. @mwpastore linked a filterable spreadsheet in comments that has a far greater variety of drives and their characteristics.

The Good

"Mainstream" and "Performance" drive models that can sync mainnet execution layer clients in a reasonable amount of time. Use M.2 NVMe if your machine supports it.

Note that in some cases older "Performance" PCIe 4 drives can be bought at a lower price than a PCIe 3 "Mainstream" drive - shop around.

  • Often on sale: Samsung 970 EVO Plus, SK Hynix P31 Gold
  • Higher TBW than most: Seagate Firecuda 530, WD Red SN700
  • Lowest power draw: SK Hynix P31 Gold - great choice for Rock5 B and other low-power devices

We've started crowd-sourcing some IOPS numbers. If you want to join the fun, run fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=150G --readwrite=randrw --rwmixread=75 and give us the read and write IOPS. Don't forget to rm test after.

Hardware

M.2 NVMe "Mainstream" - TLC, DRAM, PCIe 3

  • AData XPG Gammix S11/SX8200 Pro. Several hardware revisions. It's slower than some QLC drives. 68k/22k r/w IOPS
  • AData XPG Gammix S50 Lite
  • HP EX950
  • Mushkin Pilot-E
  • Samsung 970 EVO Plus 2TB, pre-rework (firmware 2B2QEXM7). 140k/46k r/w IOPS
  • Samsung 970 EVO Plus 2TB, post-rework (firmware 3B2QEXM7 or 4B2QEXM7). In testing this syncs just as quickly as the pre-rework drive
  • SK Hynix P31 Gold
  • WD Black SN750 (but not SN750 SE)
  • WD Red SN700

2.5" SATA "Mainstream" - TLC, DRAM

  • Crucial MX500 SATA, 46k/15k r/w IOPS
  • Samsung 860 EVO SATA
  • Samsung 870 EVO SATA, 63k/20k r/w IOPS
  • WD Blue 3D NAND SATA

Honorable Pi4 mention:

  • Samsung T5 USB - works but is slow, avoid if at all possible and go for M.2 NVMe instead, with Rock5 B or CM4. To clarify: If you stay with Pi4, then T5 USB or USB M.2 NVMe adapter should roughly perform the same. Choose either. Maybe consider going for NVMe and a USB adapter so you can upgrade to a Rock5 B in future.

M.2 NVMe "Performance" - TLC, DRAM, PCIe 4 or 5

  • ADATA XPG Gammix S70
  • Corsair Force MP600
  • Crucial P5 Plus
  • Kingston KC2000 / KC3000 / Fury Renegade
  • Mushkin Redline Vortex
  • Sabrent Rocket 4 Plus
  • Samsung 980 Pro (not 980) - a firmware update to 5B2QGXA7 is necessary to keep them from dying, if they are firmware 3B2QGXA7. Samsung's boot Linux is a bit broken, you may want to flash from your own Linux.
  • Samsung 990 Pro - there are reports of 990 Pro rapidly losing health. A firmware update to 1B2QJXD7 is meant to stop the rapid degradation, but won't reverse any that happened on earlier firmware.
  • Seagate Firecuda 530, 428k/143k r/w IOPS
  • SK Hynix P41 Platinum / Solidigm P44 Pro
  • WD Black SN850
  • WD Black SN850X, 101k/33k r/w IOPS

Data center SSD drives will also work well.

Cloud

  • Any baremetal/dedicated server service
  • AWS i3en.2xlarge
  • AWS gp3 w/ >=10k IOPS provisioned and an m6i/a.xlarge

The Bad

These "Budget" drive models are reportedly too slow to sync (all) mainnet execution layer clients.

Hardware

  • AData S40G/SX8100 4TB, QLC - the 2TB model is TLC and should be fine; 4TB is reportedly too slow
  • Crucial P1, QLC - users report it can't sync Nethermind
  • Crucial P2 and P3 (Plus), QLC and DRAMless - users report it can't sync Nethermind
  • Kingston NV1 - probably QLC and DRAMless and thus too slow on 2TB, but could be "anything" as Kingston do not guarantee specific components.
  • Kingston NV2 - like NV1 no guaranteed components
  • WD Green SN350, QLC and DRAMless
  • Anything both QLC and DRAMless will likely not be able to sync at all or not be able to consistently keep up with "chain head"
  • Crucial BX500 SATA, HP S650 SATA, probably most SATA budget drives
  • Samsung 980, DRAMless - unsure, this may belong in "Ugly". If you have one and can say for sure, please come to ethstaker Discord.
  • Samsung T7 USB, even with current firmware

Cloud

  • Contabo SSD
  • Netcup VPS Servers - reportedly able to sync Geth but not Nethermind

The Ugly

"Budget" drive models that reportedly can sync mainnet execution layer clients, if slowly.

Note that QLC drives usually have a markedly lower TBW than TLC, and will fail earlier.

Hardware

  • Corsair MP400, QLC
  • Inland Professional 3D NAND, QLC
  • Intel 660p, QLC. It's faster than some "mainstream" drives. 98k/33k r/w IOPS
  • Seagata Barracuda Q5, QLC
  • WD Black SN770, DRAMless
  • Samsung 870 QVO SATA, QLC

Cloud

  • Contabo NVMe - fast enough but not enough space. 800 GiB is not sufficient.
  • Netcup RS Servers. Reportedly fast enough to sync Nethermind or Geth; still no speed demon.
@uniyj
Copy link

uniyj commented Oct 17, 2022

Suggested addition: WD Black SN770 (note: SN750 seems to be its predecessor) also works fine.

@mwpastore
Copy link

mwpastore commented Nov 1, 2022

NewMaxx is a great resource for this stuff, in particular this flowchart, this written guide, and this filterable spreadsheet.

The spreadsheet tends to be the most up-to-date and often gives useful information in the notes, e.g. single- or double-sided, whether some components have been upgraded or downgraded over time a la Samsung 970 EVO+, whether certain capacities are significantly faster or slower, if the 4TB model is QLC even though the other capacities are TLC, etc.

@yorickdowne
Copy link
Author

Thanks for sharing! That sheet is great. Filtering for DRAM yes, TLC, and "2TB/4TB" should do the trick.

@mwpastore
Copy link

mwpastore commented Nov 7, 2022

It's starting to look like write latency and IOPS at QD1 are the most important drive specs. That would explain why e.g. Erigon doesn't sync very quickly even on my SK hynix P41 that can do ~1.3M IOPS at high queue depths and thread counts. It also explains why drives based on QLC NAND tend to be a poor choice (because QLC has absolutely terrible write latency).

EDIT: Actually, the pattern is lots of QD1 reads while assembling the batch, then a big write at a higher queue depth. So read latency and IOPS at QD1 are important, as is the ability of the drive to recover from that big write and resume reading at full speed.

Additionally, it seems like there's a lot of on-disk reorganization happening during syncing. That would explain why we're seeing an extreme rate of fragmentation on ZFS (and probably other CoW filesystems). It also explains why DRAM-less drives tend to be a poor choice (because the FTL can't keep up with the high rate of P/E cycles).

It would be interesting to try syncing with an SSD based on 3D XPoint. I have an older Optane 905p on hand, unfortunately it's only 960GB so I would have to get creative. While Optane is probably cost-prohibitive for most, if it works well it could be feasible to use a smaller Optane drive as a writeback cache for a larger capacity of cheap QLC NAND. That's effectively how Intel's own H10 and H20 products work.

@yorickdowne
Copy link
Author

yorickdowne commented Nov 9, 2022

Erigon in particular is gated by CPU performance, because it does a full sync. I've seen it sync in under 4 days on a machine with a lot of RAM and cores; and take weeks on a more "consumer" setup.

You are absolutely right that there's a lot of reorg. That's because state shifts, and with it the trie shifts. There was an article on how this behaves with Geth. While Erigon has a great many optimizations, it also cannot escape the fundamental nature of Ethereum.

@mwpastore
Copy link

mwpastore commented Nov 14, 2022

I just hit a new personal best, syncing a mainnet full node (not archive) from scratch using Erigon and Lighthouse in roughly 48 hours after downloading snapshots.

  • Erigon 2.30.0: with --externalcl --prune=htcr
  • Lighthouse 3.2.1: with checkpoint sync against my primary Lighthouse instance, without --sync-historic-states
  • CPU: i5-13600K (P-cores only, stock settings)
  • RAM: 32GB DDR4-3200 CL16 CR2 1:1 (Gear 1), no swap
  • OS: Ubuntu Server 22.04.1 LTS
  • Storage:
    • Erigon:
      • snapshots: Samsung 970 EVO+ 2TB
      • temp: Samsung 970 EVO+ 2TB
      • everything else: Intel Optane 905p 960GB
    • Lighthouse: Samsung 970 EVO+ 2TB
    • LVM2, all LVs are linear and formatted with XFS

I think the high IPC of these Raptor Cove P-cores helped a ton, and the 905p was able to keep up with the sustained, high, random read IOPS at low queue depths (up to 11K) toward the end of the initial sync. It was still hitting 36–42 blk/s at block 159* (vs ~20 at the same point on my 5950X and SK hynix P41 2TB). The 905p also seems to be able to transition more quickly between the batch assembly and commit phases—although not much moreso than a high-end Gen4 NAND flash drive like I expected.

What I'll do now is move Erigon's main LV to the 970 EVO+ and repeat the exact same sync. That will help me isolate how much of the speedup is from the Optane drive vs. the rest of the hardware. Specifically, per your notes on CPU performance, I'd like to know the difference between syncing with Raptor Cove vs Zen 3 P-cores.

UPDATE: After ~24 hours I'm at block 127* and already dropping below 40 blk/s. So far at least it seems like the 905p did speed things up measurably on otherwise the same hardware, but it remains to be seen how significantly it affected total sync time.

UPDATE 2: After ~48 hours I'm at block 153* and already dropping below 30 blk/s. The system appears to be I/O-bound per PSI; sustained random read IOPS at low queue depths seems to be peaking around 3.7K on the EVO+, or about a third of what I was seeing on the 905p at roughly the same point in the sync process.

UPDATE 3: It finished syncing last night/overnight and unfortunately I didn't check it before I went to bed and wasn't able to capture the exact time in my tmux scrollback buffer. Based on estimated block speed I would guess between 6⅓ and 9½ hours after the 48 hour mark, plus an hour or two for the remaining phases.

So in conclusion if you're largely I/O-bound as you get into blocks from 2021 and later then yes you could benefit from a drive that features high random read IOPS at low queue depths. In late 2022 that translates to an 8–12 hour speedup (48 hours vs 56–60 hours) with low-end 3D XPoint vs high-end TLC NAND on Raptor Cove. Whether it makes sense to spend hundreds and potentially thousands of dollars more on storage for such a speedup is highly questionable.

The next test would be to see if I can achieve a similar speedup with a smaller allocation of the 905p configured as a writeback cache. That would open up the possibility of pairing a relatively-cheap e.g. P1600X with some less expensive storage.

@mwpastore
Copy link

42 hours with a 110GiB slice of the 905p configured as writeback cache! That exceeds my wildest expectations.

I would still like to play around with writeback vs. writethrough modes but I think my question has been answered: Yes, Optane can significantly speed up the last leg of the initial sync once it's no longer CPU-bound. If you're running on Zen 3 or Golden Cove or newer and want to fully-optimize Erigon re-sync performance, I think it's worth considering.

Newegg has Optane P1600X 118GB drives on sale for $76 a pop. This model doesn't have the absolute best peak specs but it should absolutely destroy QD1 random read. At that price point I will be adding one to my node to complement the two MX500 2TB SATA drives I have in there already.

I'll stop spamming this thread but please let me know if you have questions. Cheers and happy validating!

@yorickdowne
Copy link
Author

Very cool, thanks for all the testing! When you say writeback cache, what exactly is the setup you're using?

@mwpastore
Copy link

mwpastore commented Nov 21, 2022

When you say writeback cache, what exactly is the setup you're using?

LVM2. You create a logical volume, tell LVM2 it's a cache volume, then attach it to another LV. Something like this:

lvcreate -L 110G -n cache_1 vgeth /dev/nvme0n1
lvcreate -L 110M -n cache_1_meta vgeth /dev/nvme0n1
lvconvert --type cache-pool --poolmetadata vgeth/cache_1_meta vgeth/cache_1
lvconvert --type cache --cachepool vgeth/cache_1 --cachemode writeback vgeth/erigon-mainnet

There are easier ways to do it; I use this procedure which includes some extra steps that allow me to optionally put the metadata on a different PV than the data, or use a non-linear topology for the data and/or metadata.

@BvL13
Copy link

BvL13 commented Apr 6, 2023

Awesome list. At @AvadoDServer we switched to the Kingston Fury Renegade with Heatsink for 2, 4, 6, 10, 12 TB Versions

@dreadedhamish
Copy link

Question about testing - using this command:
fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=150G --readwrite=randrw --rwmixread=75
On one drive I'm trouble-shooting this will take 36 hours. I'm unsure of the mechanism of the test - would I need to wait for the test to complete to gain accurate results, or would stopping it after 2 hours give me results that are close enough?

@yorickdowne
Copy link
Author

If it takes that long then that’s a very slow drive. You can absolutely just abandon it and mark the drive as slow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment