Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 43 You must be signed in to star a gist
  • Fork 5 You must be signed in to fork a gist
  • Save Brainiarc7/3179144393747f35e5155fdbfd675554 to your computer and use it in GitHub Desktop.
Save Brainiarc7/3179144393747f35e5155fdbfd675554 to your computer and use it in GitHub Desktop.
Temporary fix for AER's excessive `severity=Corrected` logging for Intel Wireless (Avell G1513 Fire V3) (Arch Linux)
silly gist hack, why do we need you? :(

How to use

Drop the .service file into /etc/systemd/system/, and then activate the script via systemctl:

# systemctl daemon-reload
# systemctl enable fix-intel_wifi_aer-avell_g1513_fire_v3.service
# systemctl start fix-intel_wifi_aer-avell_g1513_fire_v3.service

This will effectively disable the "corrected" severity logging for the device, and save you loads of (logging) disk space. :)

Reasoning

Sorry for the poor explanation, future self. I'm kinda tired right now. I don't even know if all of this is correct. :(

When AER becomes too active in logging errors, it's generally something to do with buggy hardware or drivers. What most people recommend is to disable AER via a kernel parameter such as pci=noaer. If you know that the affected device is fine, and that the device's driver indeed has a bug that's still not fixed but won't affect proper usage, you can just disable AER for specific severity levels by setting the flags directly into the device via setpci, instead of disabling AER globally.

For more info on setpci, please see its docs.

AER (Advanced Error Reporting) is a PCIe capability. Linux adds support for it through a kernel module that is started sometime during systemd-modules-load.service's execution. The AER driver initializes reporting for PCIe devices at startup, so it's important that we only reset the flags AFTER systemd's module loading service.

According to the AER module's source code, the four severity levels (Corrected, Error, Fatal and Undefined) are always enabled when AER is enabled for a device:

// From `/usr/include/uapi/linux/pci_regs.h`
#define PCI_EXP_DEVCTL		8	/* Device Control */
#define  PCI_EXP_DEVCTL_CERE	0x0001	/* Correctable Error Reporting En. */
#define  PCI_EXP_DEVCTL_NFERE	0x0002	/* Non-Fatal Error Reporting Enable */
#define  PCI_EXP_DEVCTL_FERE	0x0004	/* Fatal Error Reporting Enable */
#define  PCI_EXP_DEVCTL_URRE	0x0008	/* Unsupported Request Reporting En. */

// From `source/drivers/pci/pcie/aer/aerdrv_core.c`
#define	PCI_EXP_AER_FLAGS	(PCI_EXP_DEVCTL_CERE | PCI_EXP_DEVCTL_NFERE | \
				 PCI_EXP_DEVCTL_FERE | PCI_EXP_DEVCTL_URRE)

int pci_enable_pcie_error_reporting(struct pci_dev *dev)
{
	if (pcie_aer_get_firmware_first(dev))
		return -EIO;

	if (!dev->aer_cap)
		return -EIO;

	return pcie_capability_set_word(dev, PCI_EXP_DEVCTL, PCI_EXP_AER_FLAGS);
}

Inspecting the kernel's source code some more, one can find that PCI_EXP_DEVCTL is an offset on the device's dev->pcie_cap PCIe capability flags, and that is itself yet another offset on the device's starting memory location. If you follow the implementation of pcie_capability_set_word and its dependencies (function calls), you end up in pcie_capability_write_dword:

// From `source/drivers/pci/access.c`

int pcie_capability_write_dword(struct pci_dev *dev, int pos, u32 val)
{
	if (pos & 3)
		return -EINVAL;

	if (!pcie_capability_reg_implemented(dev, pos))
		return 0;

	return pci_write_config_dword(dev, pci_pcie_cap(dev) + pos, val);
}

// From `/usr/include/linux/pci.h`

static inline int pcie_capability_set_word(struct pci_dev *dev, int pos,
					   u16 set)
{
	return pcie_capability_clear_and_set_word(dev, pos, 0, set);
}

static inline int pci_pcie_cap(struct pci_dev *dev)
{
	return dev->pcie_cap;
}

Depending on the machine's setup, setpci may list the register name CAP_EXP as available through setpci --dumpregs. This register refers to the dev->pcie_cap offset. To identify how AER is configured, one needs the device/vendor or bus/slot/function combination for the affected device. AER's logged messages already have this information. Below is an example, from where we can take two different identifiers for the device: 8086:a114 (device/vendor ID) and 0000:00:1c.4 (domain/bus/slot/function).

# dmesg | tail -n 4
[ 4455.385233] pcieport 0000:00:1c.4: AER: Corrected error received: id=00e4
[ 4455.385242] pcieport 0000:00:1c.4: PCIe Bus Error: severity=Corrected, type=Physical Layer, id=00e4(Receiver ID)
[ 4455.385250] pcieport 0000:00:1c.4:   device [8086:a114] error status/mask=00000001/00002000
[ 4455.385254] pcieport 0000:00:1c.4:    [ 0] Receiver Error         (First)

To check which is the affected device, see lshw or lspci:

[flisboac@sonic ~]$ sudo lspci -v -s 00:1c.4
00:1c.4 PCI bridge: Intel Corporation Sunrise Point-H PCI Express Root Port #5 (rev f1) (prog-if 00 [Normal decode])
	Flags: bus master, fast devsel, latency 0, IRQ 124
	Bus: primary=00, secondary=03, subordinate=03, sec-latency=0
	I/O behind bridge: None
	Memory behind bridge: df200000-df2fffff [size=1M]
	Prefetchable memory behind bridge: None
	Capabilities: [40] Express Root Port (Slot+), MSI 00
	Capabilities: [80] MSI: Enable+ Count=1/1 Maskable- 64bit-
	Capabilities: [90] Subsystem: Device 1d05:1021
	Capabilities: [a0] Power Management version 3
	Capabilities: [100] Advanced Error Reporting
	Capabilities: [140] Access Control Services
	Capabilities: [220] #19
	Kernel driver in use: pcieport
	Kernel modules: shpchp

In this case, the error may refer to a device attached to a PCIe port. One can check which device is attached to said port with lshw:

# lshw -numeric
sonic
    description: Notebook
    product: 1513 (To be filled by O.E.M.)
    vendor: Avell High Performance
    version: To be filled by O.E.M.
    serial: To be filled by O.E.M.
    width: 4294967295 bits
    capabilities: smbios-3.0 dmi-3.0 smp vsyscall32
    configuration: boot=normal chassis=notebook family=To be filled by O.E.M. sku=To be filled by O.E.M. uuid=00020003-0004-0005-0006-000700080009
  *-core
       description: Motherboard
       physical id: 0
       version: 0.1
       serial: To be filled by O.E.M.
       slot: To be filled by O.E.M.

       (... lshw is so verbose ...)

     *-pci
          description: Host bridge
          product: Skylake Host Bridge/DRAM Registers [8086:1910]
          vendor: Intel Corporation [8086]
          physical id: 100
          bus info: pci@0000:00:00.0
          version: 07
          width: 32 bits
          clock: 33MHz
          configuration: driver=skl_uncore
          resources: irq:0

	(... lshw is so verbose ...)

        *-pci:2
             description: PCI bridge
             product: Sunrise Point-H PCI Express Root Port #5 [8086:A114]
             vendor: Intel Corporation [8086]
             physical id: 1c.4
             bus info: pci@0000:00:1c.4
             version: f1
             width: 32 bits
             clock: 33MHz
             capabilities: pci pciexpress msi pm normal_decode bus_master cap_list
             configuration: driver=pcieport
             resources: irq:124 memory:df200000-df2fffff
           *-network
                description: Wireless interface
                product: Wireless 7265 [8086:95A]
                vendor: Intel Corporation [8086]
                physical id: 0
                bus info: pci@0000:03:00.0
                logical name: wlp3s0
                version: 48
                serial: 64:80:99:f3:9d:d7
                width: 64 bits
                clock: 33MHz
                capabilities: pm msi pciexpress bus_master cap_list ethernet physical wireless
                configuration: broadcast=yes driver=iwlwifi driverversion=4.10.13-1-ARCH firmware=17.459231.0 ip=192.168.1.26 latency=0 link=yes multicast=yes wireless=IEEE 802.11
                resources: irq:137 memory:df200000-df201fff

Summarizing, CAP_EXP is the base regitry, and we make some kind of pointer arithmetic with it. We offset CAP_EXP by PCI_EXP_DEVCTL, and write the proper flags to it as a single word. Just remember that PCI_EXP_* is defined as decimals, while setpci only accepts hexadecimals (have them the hexadecimal prefix 0x or not), so some base conversion may be needed -- although that's not the case for PCI_EXP_DEVCTL.

So, to read the current configuration:

[flisboac@sonic ~]$ sudo setpci -v -d 8086:a114 CAP_EXP+0x8.w
0000:00:1c.4 (cap 10 @40) @48 = 000f

000f tells us that all AER severity flags are set. The Corrected severity is bit 0 in that word, so we just need to set the new value to 000e to disable only the Corrected severity reporting:

[flisboac@sonic ~]$ sudo setpci -v -d 8086:a114 CAP_EXP+0x8.w=0x0e
0000:00:1c.4 (cap 10 @40) @48 000e

And that's it!

[Unit]
Description=Fix for AER's excessive logging for Intel Wireless (Avell G1513 Fire V3)
After=systemd-modules-load.service
[Service]
Type=oneshot
# Change your device and vendor (or bus/slot/function accordingly)
ExecStart=/usr/bin/setpci -v -d 8086:a114 CAP_EXP+0x8.w=0xe
RemainAfterExit=yes
[Install]
WantedBy=network.target
@jult
Copy link

jult commented Sep 24, 2019

Could you help me out with this AER filling up syslog on a server I maintain? Here's a snippet;

Sep 23 11:09:05 silent kernel: pcieport 0000:00:1b.4: AER: Multiple Corrected error received: 0000:00:1b.4
Sep 23 11:09:05 silent kernel: pcieport 0000:00:1b.4: can't find device of ID00dc
Sep 23 11:09:05 silent kernel: pcieport 0000:00:1b.4: AER: Corrected error received: 0000:00:1b.4
Sep 23 11:09:05 silent kernel: pcieport 0000:00:1b.4: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Sep 23 11:09:05 silent kernel: pcieport 0000:00:1b.4:   device [8086:a32c] error status/mask=00000001/00002000
Sep 23 11:09:05 silent kernel: pcieport 0000:00:1b.4:    [ 0] RxErr                  (First)
Sep 23 11:09:05 silent kernel: pcieport 0000:00:1b.4: AER: Corrected error received: 0000:00:1b.4
Sep 23 11:09:05 silent kernel: pcieport 0000:00:1b.4: can't find device of ID00dc
Sep 23 11:09:05 silent kernel: pcieport 0000:00:1b.4: AER: Corrected error received: 0000:00:1b.4
Sep 23 11:09:05 silent kernel: pcieport 0000:00:1b.4: can't find device of ID00dc
Sep 23 11:09:05 silent kernel: pcieport 0000:00:1b.4: AER: Corrected error received: 0000:00:1b.4
Sep 23 11:09:05 silent kernel: pcieport 0000:00:1b.4: can't find device of ID00dc
Sep 23 11:09:05 silent kernel: pcieport 0000:00:1b.4: AER: Multiple Corrected error received: 0000:00:1b.4
Sep 23 11:09:05 silent kernel: pcieport 0000:00:1b.4: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Sep 23 11:09:05 silent kernel: pcieport 0000:00:1b.4:   device [8086:a32c] error status/mask=00000001/00002000
Sep 23 11:09:05 silent kernel: pcieport 0000:00:1b.4:    [ 0] RxErr                  (First)

and this is the device throwing out the errors, or the device in this pci-e slot causes it (an LSI HBA storage card);

root@test~# lspci -v -s 00:1b.4
00:1b.4 PCI bridge: Intel Corporation Cannon Lake PCH PCI Express Root Port #21 (rev f0) (prog-if 00 [Normal decode])
        Flags: bus master, fast devsel, latency 0, IRQ 125
        Bus: primary=00, secondary=05, subordinate=05, sec-latency=0
        I/O behind bridge: 00006000-00006fff [size=4K]
        Memory behind bridge: 91100000-912fffff [size=2M]
        Prefetchable memory behind bridge: None
        Capabilities: [40] Express Root Port (Slot+), MSI 00
        Capabilities: [80] MSI: Enable+ Count=1/1 Maskable- 64bit-
        Capabilities: [90] Subsystem: ASRock Incorporation Device a32c
        Capabilities: [a0] Power Management version 3
        Capabilities: [100] Advanced Error Reporting
        Capabilities: [140] Access Control Services
        Capabilities: [150] Precision Time Measurement
        Capabilities: [220] Secondary PCI Express <?>
        Capabilities: [250] Downstream Port Containment
        Kernel driver in use: pcieport

so then I did

root@test~# setpci -v -d 8086:a32c CAP_EXP+0x8.w
0000:00:1b.4 (cap 10 @40) @48 = 002f

assuming CAP_EXP is the correct register (probably not), I tried

root@test~# setpci -v -d 8086:a32c CAP_EXP+0x8.w=0x2e
0000:00:1b.4 (cap 10 @40) @48 002e
root@test~# setpci -v -d 8086:a32c CAP_EXP+0x8.w
0000:00:1b.4 (cap 10 @40) @48 = 002e

Would that work?

@fpeterschmitt
Copy link

fpeterschmitt commented Dec 28, 2019

Works like a charm on a Thinkpad E480 with device 8086:9D1A. Thanks a lot!

@tweinreich
Copy link

@Brainiarc7: You are my personal hero of today! Thank you so much for sharing this fix. It would probably have taken hours for me to find out how to fix this.

It worked for the following setup:

  • Hystou Mini PC (the "scooter computer" known from Coding Horror's blog article)
  • Responsible Interface: Realtek RTL8821AE 802.11ac PCIe Wireless Network Adapter

Now all I need is to integrate it into my ansible playbook and never (TM) think about it again ;-P

Have a great weekend.

@Brainiarc7
Copy link
Author

Brainiarc7 commented Feb 14, 2020 via email

@GTMxCode
Copy link

GTMxCode commented Aug 18, 2020

Hey, I really appreciate you taking the time to write this up. I just sunk my teeth into this very same error (the bus and port numbers are even identical) that Ive been getting running the zen kernel w/ Arch on an asus z270e/7700k.

I noticed a couple of differences in the way that specific pcieport/pcibus being assigned compared to the others, but one main thing I noted was that the between the device being assigned and taken over the initrd finishes. Now is that just an artifact due to the logging or is the memory allocated getting messed up when the rd dumps?

Im still pretty new and trying to figure out some other things like why asus drivers for eeepcs are loaded, and the fact that I noticed messages like shpchp not being suported by the _OS but a few lines later it loads that module and then throws an a warning saying it was unable to run it. There are a few others that end up failing or dropping to fallbacks, and they all centered around this particular error. finding this leads me to believe its just a coincidence of user space start up and not related, but I still wonder - even the message about resorting to using the acpi bridge doesnt sound right, it implies to me that the proper method of mapping and assigning the ports and busses isnt available and whstever its doing with acpi is a last resort? what IS the ideal method? Theres not much sense in reaching for sub 5 second boots. What also puzzles me is that I have both wifi and Bluetooth disabled entirely in the bios/uefi. I found this after searchibg for whatever PCI-MSI aerdrv, as thats what the irqs map to... i think I have more questions now than i did before haha.

I can be more detailed if anyone would like, I just dont have access to my logs at this very moment.

Either way, even with your great explanation which helps immensely to point me in the right direction, Ive still got some digging to do. Only a fool would blindly write to their hardware.. right?

Cheers bud.

edit:
Eeeeeee what I said above wasnt a dig at the guy who basically did that.. lol, I seem to love putting my foot in my own mouth.. nothing was meant by it, honest.

@jefstath
Copy link

Just wanted to say thanks for this write up. I experienced the same the problem with a USB 3.0 host controller (1B73:1100) connected to PCIe root port at the same address as you described:

Aug 12 11:08:46 pbcl-dsk9 kernel: [ 9387.060092] pcieport 0000:00:1c.4: device [8086:a114] error status/mask=00000001/00002000
Disabling the correctable error reporting bit did the job.

Thanks again!

@briantbutton
Copy link

Sir! You are a gentleman and a scholar. I add my thanks to the others above. Salud!

@khromov
Copy link

khromov commented Jan 16, 2023

I have an Intel Wireless 7265 chipset and all the commands worked fine up until the last one that was supposed to fix the problem, it errors:

root@k-NucBox5:/home/k# setpci -v -d 8086:4db9 CAP_EXP+0x8.w=0x2e
pcilib: sysfs_write: write failed: Operation not permitted
0000:00:1c.0 (cap 10 @40) @48 002e

Any solutions appreciated!

PS. The device in question is running Secure Boot, perhaps that is the issue. Can it be solved?

@Brainiarc7

@ttsiodras
Copy link

Thank you! Works perfectly for an n5095 laptop I just bought, with an "rtw_8822ce"-supported wifi/bluetooth.
In my case reading the state gave me

# setpci -v -d 10ec:c822 CAP_EXP+0x8.w
0000:01:00.0 (cap 10 @70) @78 = 201f

...so I needed to adapt it like so:

# setpci -v -d 10ec:c822 CAP_EXP+0x8.w=0x201e

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment