Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 16 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save zekome/35db528b33206e68f18439ad7fabfcd5 to your computer and use it in GitHub Desktop.
Save zekome/35db528b33206e68f18439ad7fabfcd5 to your computer and use it in GitHub Desktop.

Turn off AER logging for NVMe and event severity corrected

Motherboard: Asus Pro WS WRX80E-SAGE SE WIFI
Card: Asus HYPER M.2 X16 GEN 4 CARD
NVMe: 4x Samsung SSD 980 PRO 1TB
OS: Linux fedora 5.16.12-200.fc35.x86_64

AER, advanced error reporting logs excessively:

dmesg

nvme 0000:44:00.0: AER: aer_status: 0x00000001, aer_mask: 0x00000000
nvme 0000:44:00.0:    [ 0] RxErr                  (First)
nvme 0000:44:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
nvme 0000:44:00.0: AER: aer_status: 0x00000001, aer_mask: 0x00000000
nvme 0000:44:00.0:    [ 0] RxErr                  (First)
nvme 0000:44:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID

{2085}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 514
{2085}[Hardware Error]: It has been corrected by h/w and requires no further action
{2085}[Hardware Error]: event severity: corrected
{2085}[Hardware Error]:  Error 0, type: corrected
{2085}[Hardware Error]:   section_type: PCIe error
{2085}[Hardware Error]:   port_type: 0, PCIe end point
{2085}[Hardware Error]:   version: 0.2
{2085}[Hardware Error]:   command: 0x0406, status: 0x0010
{2085}[Hardware Error]:   device_id: 0000:44:00.0
{2085}[Hardware Error]:   slot: 0
{2085}[Hardware Error]:   secondary_bus: 0x00
{2085}[Hardware Error]:   vendor_id: 0x144d, device_id: 0xa80a
{2085}[Hardware Error]:   class_code: 010802
{2085}[Hardware Error]:   bridge: secondary_status: 0x0000, control: 0x0000

Note device id in logs. In this case it's 0000:44:00.0. Also there are similar logs for all four NVMe disks on the same card with respective device ids 0000:43:00.0, 0000:42:00.0, 0000:41:00.0. Then, for each device id (for example: 0000:44:00.0) turn off corrected-severity bit (clear the first bit) if set. Get the current value for CAP_EXP register and XOR it with 0x1 to toggle.

setpci -v -s 0000:44:00.0 CAP_EXP+0x8.w
0000:44:00.0 (cap 10 @70) @78 = 2937

So, the bit is set... toggle: 0x2937 XOR 0x1 = 0x2936

setpci -v -s 0000:44:00.0 CAP_EXP+0x8.w=0x2936
0000:44:00.0 (cap 10 @70) @78 2936

Device id and CAP_EXP values might differ in other cases.

@ktonini
Copy link

ktonini commented Aug 22, 2022

Thank you so much for this!

It needed to be run on every boot so I threw it into a .service based on this -

[Unit]
Description=Fix for AER's excessive logging for NVME devices
After=systemd-modules-load.service

[Service]
Type=oneshot
-ExecStart=/usr/bin/setpci -v -s 0000:42:00.0 CAP_EXP+0x8.w=0x2936
-ExecStart=/usr/bin/setpci -v -s 0000:43:00.0 CAP_EXP+0x8.w=0x2936
+ExecStart=/bin/sh -c '/usr/bin/setpci -v -s 0000:40:00.0 CAP_EXP+0x8.w=0x2936; /usr/bin/setpci -v -s 0000:40:01.1 CAP_EXP+0x8.w=0x2936; /usr/bin/setpci -v -s 0000:41:00.0 CAP_EXP+0x8.w=0x2936; /usr/bin/setpci -v -s 0000:42:00.0 CAP_EXP+0x8.w=0x2936; /usr/bin/setpci -v -s 0000:44:00.0 CAP_EXP+0x8.w=0x2936;'
RemainAfterExit=yes

[Install]
-WantedBy=network.target
+WantedBy=multi-user.target

Pretty sure WantedBy is wrong, but I'm not totally proficient in systemd services so I'll have to do some digging. Works for now.

Update:

I was still getting errors and realized only the last ExecStart is run. Here is an updated service file -

[Unit]
Description=Fix for AER's excessive logging for NVME devices
After=systemd-modules-load.service

[Service]
Type=oneshot
ExecStart=/bin/sh -c '/usr/bin/setpci -v -s 0000:40:00.0 CAP_EXP+0x8.w=0x2936; /usr/bin/setpci -v -s 0000:40:01.1 CAP_EXP+0x8.w=0x2936; /usr/bin/setpci -v -s 0000:41:00.0 CAP_EXP+0x8.w=0x2936; /usr/bin/setpci -v -s 0000:42:00.0 CAP_EXP+0x8.w=0x2936; /usr/bin/setpci -v -s 0000:44:00.0 CAP_EXP+0x8.w=0x2936;'
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target

@Allen9168
Copy link

My Samsung PM1733 is also encountering this issue. I know that changing Gen4 to Gen3 in the BIOS can solve this problem, but a few of my servers' BIOS do not support this setting. I am unsure if there are any other solutions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment