microraptor/ecc.md

## ecc.md

      
    Raw
  

              ecc.md
            
          
    Usually ECC memory is used in servers and not desktop workstations. However, Ryzen supports it, unlike Intel desktop CPUs. Except Ryzen non-Pro APUs (those with integrated graphics), don't have ECC support. The mainboard also has to support it and luckily most Asrock, Asus and some Gigabytes do. Please note that the ECC functionality has nothing directly to do with being registered and memory for Ryzen still has to be unregistered/unbuffered like in a normal desktop build.
The most common type of ECC memory has single-error correction and double-error detection (SECDED). Correctable 1-bit errors are corrected automatically. If an uncorrectable 2-bit error occurs, Linux will kill the process the memory is assigned to, while Windows goes straight to a BSOD stating an uncorrectable error occurred.
A small benefit of this compared to non-ECC memory is that obviously occasional 1-bit errors have no effect instead of the chance of silently corrupting data. However, errors should be incredible rare for stable clocked memory of both kinds. The huge benefit of ECC vs non-ECC is that its easy to determine when your memory is failing with error logs, or straight up error messages in cases of 2-bit errors. Failing non-ECC memory might go unnoticed for a while and the troubleshooting or random errors probably made way too many people become crazy.
Typical ECC memory has 9 instead of 8 chips for the parity data and therefore slightly higher hardware costs. Additionally, not as many consumer ECC UDIMMS are produced. In contrast to performance non-ECC UDIMMS the ECC ones are also usually conservatively and low rated with their clock speeds and timings. Presumably they are not binned as thoroughly and it is up to yourself to see how far you can push the speeds. Because of this and the small market, I found it quite difficult to compare prices.
I picked up two M391A1K43BB1-CRC 8GB DDR4-2400 17-17-17 sticks on Ebay. The 16GB variant is M391A2K43BB1-CRC. They are Samsung B-dies, but don't clock as well as you might suspect. At 1.35 DRAM voltage I am running them stable at 3466-18-18-18-36 1T with geardown, although I didn't experiment much with tightening the timings.
I have an Asus Prime X470-Pro with a Ryzen 5 3600 and ECC was enabled automatically. This can be confirmed in various apps, running this command, memtest86 or in Linux running sudo dmidecode --type memory journalctl -k | grep -i edac or edac-util -s. The detection mode can be displayed with cat /sys/devices/system/edac/mc/mc0/rank?/dimm_edac_mode. For my system the command edac-util -v, which is supposed to display a count of corrected and uncorrected errors, didn't work and always displayed 0. However, the number of corrected (1-bit) and uncorrected (2-bit) errors since boot can still be displayed with those two commands:
cat /sys/devices/system/edac/mc/mc0/rank*/dimm_ce_count
cat /sys/devices/system/edac/mc/mc0/rank*/dimm_ue_count
Hints for Bash newbies: Use tab to autocomplete while typing words and paths, as well as the up and down arrows to go through the command history. Don't manually type out this ridiculous path. The * in the path automatically expands to all available ranks when running the command.
Corrected errors are also logged as 'Hardware Error' in the kernel log and uncorrected ones as 'Memory failure' in Ubuntu. To see all past ECC errors in the log these commands can be used:
journalctl -rt kernel | grep -F '[Hardware Error]' | less
journalctl -rt kernel | grep -F 'Memory failure' | less
Rasdaemon also worked fine for me and can be used to log errors for eternity. After installing the service the current numbers are shown with ras-mc-ctl --summary.
In Windows I was not able to detect corrected errors. I would be thankful, if anybody knows how I could log them. Usually they are logged as WHEA events in the system log of the Event Viewer, but that doesn't seem to work with my system (there are a lot of reports, that this works with Asrock boards though). At least uncorrected errors in Windows are easy to see in the moment with a large error message on the rather obvious  BSOD.
When overclocking ECC memory it is important to remember, that not all stability testing tools will detect single bit errors on ECC memory, since the tool might look for wrong memory values, which your system will never deliver. I used the following method and recommend an Ubuntu live USB stick, if you don't have a Linux distro handy. IF WHEA logging of corrected errors in Windows works for you, you can do the same in Windows too. This can also be used to confirm the ECC error handling works correctly, after raising the clock speed until the memory becomes unstable. First disable any swap, so only physical memory is used: sudo swapoff -a. I am using the stressapptest, because it was the fastest to produce memory errors for me. In Ubuntu it can be installed with sudo apt install stressapptest. Run it via: sudo stresstestapp -M 14500 -s 300. 14500 is the amount of MB used in the RAM and works fine with my 16GB, but might have to be adjusted for you. 300 is the number of seconds to run the test and can be adjusted to your liking. To monitor for errors run these commands in parallel in their own terminal windows:
journalctl -fkx
watch cat /sys/devices/system/edac/mc/mc0/rank*/dimm_ce_count
watch cat /sys/devices/system/edac/mc/mc0/rank*/dimm_ue_count
Hint: to boot straight into the UEFI settings use the command systemctl reboot --firmware-setup. Something similar can be done in Windows by holding shift while clicking on restart.
The same as with non-ECC memory, I recommend to also test stability using Passmark's memtest86, which is installed on a live USB stick. The default settings with 4 passes take about 3 hours on my system.