I was looking after a Dell PowerEdge R420 server with a hardware RAID card in it, a PERC8 H710P Mini. Connected to this were two WD Blue 3D NAND SATA SSDs in RAID-1, that were handling a write-intensive database: Geth, in this case.
After 11 months of running without problems, I had reason to "resync" Geth, and the server could not keep up with the required IOPS.
Symptoms were Database compacting, degraded performance
messages for hours, without recovery, and read/write latency in excess of 20/70ms,
as measured by sudo iostat -mdx
and its r_await
and w_await
data.
The cause of this was that the RAID controller does not support TRIM on the SSD, and the SSD's performance degraded over time because of it.
To solve this, I decided to flash the RAID controller to an HBA, "IT" mode. It would lose all RAID capability. At that point, I could use Linux software RAID, and have TRIM support.
Very, very carefully, cross-flash the PERC to "IT" mode to make it an HBA. Removing the battery first is a good idea, and keep in mind that H710P Mini, for example, may exist in a non-blade server as well, and has different instructions than H710 Mini. Getting this wrong can brick the controller. Assuming that your server is set to UEFI boot, make sure to also flash the optional UEFI boot ROM to the controller.
After this, the system should still boot from your primary disk - but you lost RAID, so let's wipe and set up software RAID.
Boot into a Linux Live ISO and wipe the Dell RAID markers from the drives, as well as all partition information. This will mean you lose all data on these SSDs.
sudo dd if=/dev/zero of=/dev/sda bs=512 count=4096 seek=$(expr `sudo blockdev --getsz /dev/sda` - 4096)
sudo dd if=/dev/zero of=/dev/sda bs=512 count=4096
sudo dd if=/dev/zero of=/dev/sdb bs=512 count=4096 seek=$(expr `sudo blockdev --getsz /dev/sdb` - 4096)
sudo dd if=/dev/zero of=/dev/sdb bs=512 count=4096
Verify with sudo wipefs /dev/sda
and sudo dmraid -r
that the markers are gone.
Boot into an Ubuntu Server install ISO and follow instructions to set up software RAID-1.
Lastly, once you booted into your fresh Ubuntu Server install, run sudo fstrim -av
.
After this, your SSDs should be as performant as they were when new, or nearly so. In my testing, read/write
latency as measured by sudo iostat -mdx
during Geth sync came back down to 0.8/2.5 ms, or thereabouts.
The latest 9400/9500 series controllers do support TRIM. For Dell, that is the PERC11 (H755(N), H750, H350 and H355) line I believe.
PERC9 does not support TRIM in RAID mode (H330, H730, H730P, H730P MX, and H830 cards). It can in JBOD, but at that point you're better off cross-flashing it to a plain HBA. Ditto PERC10 (H345, H740P, H745, H745P MX, and H840 cards).
Software RAID1 in Linux has the same write performance as a single disk, and doubles the read performance for multi-file reads.