Skip to content

Instantly share code, notes, and snippets.

@mcastelino
Last active January 24, 2023 23:50
Show Gist options
  • Save mcastelino/96e5fb1e47ffd5ec5b99bb5843855769 to your computer and use it in GitHub Desktop.
Save mcastelino/96e5fb1e47ffd5ec5b99bb5843855769 to your computer and use it in GitHub Desktop.
APCI QEMU PCI Discovery, Enumeration and Hotplug

Overview

This document attempts to call out all the elements and mechanisms involved in the discovery as well as hotplug of PCI devices including

  • QEMU framework
  • APCI Tables and Methods
  • Linux Kernel functions and tables

The logic pertaining to the GED interrupt based APCI eventing is specific to NEMU. The rest of this document is generic.

QEMU Hotplug Registration

QEMU registers the PCI bus for hotplug using the qdev framework using the following function.

qbus_set_hotplug_handler(BUS(pci_bus), dev, NULL);

ACPI Support for hotplug

The ACPI DSDT table is populated to enable QEMU to interact with Linux guest OS and vice versa.

In virt_acpi_init() we invoke the following

  • qbus_set_hotplug_handler() as mentioned above
  • acpi_pcihp_init() which sets up the ACPI resources used for hotplug for a particular bus
  • acpi_pcihp_reset() which enables hotplug on all child buses under this bus

In acpi_pcihp_init() QEMU allocates an IO Range from 0xae00 with size 0x0014

#define ACPI_PCIHP_ADDR 0xae00
#define ACPI_PCIHP_SIZE 0x0014
#define PCI_UP_BASE 0x0000
#define PCI_DOWN_BASE 0x0004
#define PCI_EJ_BASE 0x0008
#define PCI_RMV_BASE 0x000c
#define PCI_SEL_BASE 0x0010

Note: This IO Range is used by ACPI to manage hotplug for this bus, but it is not a PCI resource (like an IO BAR).

QEMU registers read and write handlers to read from and write to this address range from the guest OS. This allows QEMU to detect and take appropriate actions based on reads and writes to this range using the pci_read and pci_write callback handlers.

ACPI PCI DSDT

The DSDT table communicate the following information to the guest OS

  • The location of this memory aperture
  • The layout of this memory aperture (i.e. the fields that are defined within this aperture)
  • Methods that read from and write to this aperture

Note: The OS only cares about the aperture to the extent that it ensures that this range is not used by itself or allocated to other devices. All other interactions using this IO range is contained within the methods defined by APCI. So the OS does not need to be modified if we wish to extend this implementation. The OS also scans the APCI tables to look for specific methods (EJ0) implemented by Devices to detect that APCI hotplug is implemented by a device

   Scope (\_SB.PCI0)
    {
        OperationRegion (PCST, SystemIO, 0xAE00, 0x08)
        Field (PCST, DWordAcc, NoLock, WriteAsZeros)
        {
            PCIU,   32,
            PCID,   32
        }

        OperationRegion (SEJ, SystemIO, 0xAE08, 0x04)
        Field (SEJ, DWordAcc, NoLock, WriteAsZeros)
        {
            B0EJ,   32
        }

        OperationRegion (BNMR, SystemIO, 0xAE10, 0x04)
        Field (BNMR, DWordAcc, NoLock, WriteAsZeros)
        {
            BNUM,   32
        }

        Mutex (BLCK, 0x00)
        Method (PCEJ, 2, NotSerialized)
        {
            Acquire (BLCK, 0xFFFF)
            BNUM = Arg0
            B0EJ = (One << Arg1)
            Release (BLCK)
            Return (Zero)
        }

        Name (BSEL, Zero)
        Device (S00)
        {
            Name (_ADR, Zero)  // _ADR: Address
        }
        Device (S08)
        {
            Name (_ADR, 0x00010000)  // _ADR: Address
            Name (_SUN, One)  // _SUN: Slot User Number
            Method (_EJ0, 1, NotSerialized)  // _EJx: Eject Device
            {
                PCEJ (BSEL, _SUN)
            }
        }
        ...

ACPI PCI DSDT Field Mapping to QEMU

Co-relating the two tables we have

	QEMU						DSDT
ACPI_PCIHP_ADDR 0xae00	-> OperationRegion (PCST, SystemIO, 0xAE00, 0x08)
PCI_UP_BASE			-> PCIU,   32,
     	PCI_DOWN_BASE 			-> PCID,   32
PCI_EJ_BASE 0x0008		-> B0EJ
PCI_RMV_BASE 0x000c		-> ???
PCI_SEL_BASE 0x0010		-> BNUM

Where PCIU (PCI UP) and PCID (PCI DOWN) values are 32 bits masks describing the 32 slots on the host bridge. The DSDT table defines them under the PCST operation region.

They are also mapped in QEMU to the virt platform data structures as follows

typedef struct AcpiPciHpPciStatus {
    uint32_t up;
    uint32_t down;
    uint32_t hotplug_enable;
} AcpiPciHpPciStatus;

typedef struct AcpiPciHpState {
    AcpiPciHpPciStatus acpi_pcihp_pci_status[ACPI_PCIHP_MAX_HOTPLUG_BUS];
    uint32_t hotplug_select;
    PCIBus *root;
    MemoryRegion io;
    bool legacy_piix;
    uint16_t io_base;
    uint16_t io_len;
} AcpiPciHpState;

typedef struct VirtAcpiState {
    SysBusDevice parent_obj;

    AcpiCpuHotplug cpuhp;
    CPUHotplugState cpuhp_state;

    MemHotplugState memhp_state;
    qemu_irq *gsi;

    AcpiPciHpState pcihp_state;
    PCIBus *pci_bus; [Note: This currently supports a single bus]

    MemoryRegion sleep_iomem;
    MemoryRegion reset_iomem;
} VirtAcpiState;

ACPI GED Event for PCI Hotplug

QEMU notifies the OS of hotplug/unplug events using the ACPI interrupts setup using the GED device. The ACPI table corresponding to the GED device maps event to the interrupt which then maps to actions it needs to perform on receipt of these events

            Device (\_SB.GED)
            {
                Name (_HID, "ACPI0013" /* Generic Event Device */)  // _HID: Hardware ID
                Name (_UID, Zero)  // _UID: Unique ID
                Name (_CRS, ResourceTemplate ()  // _CRS: Current Resource Settings
                {
                    ...
                    Interrupt (ResourceConsumer, Level, ActiveHigh, Exclusive, ,, )
                    {
                        0x00000012,
                    }
                })
                Method (_EVT, 1, Serialized)  // _EVT: Event
                {
                    Local0 = One
                    While ((Local0 == One))
                    {
                        Local0 = Zero
                        If ((Arg0 == 0x10))
                        {
                            \_SB.CPUS.CSCN ()
                        }
                        ....
                        ElseIf ((Arg0 == 0x12))
                        {
                            Acquire (\_SB.PCI0.BLCK, 0xFFFF)
                            \_SB.PCI0.PCNT ()
                            Release (\_SB.PCI0.BLCK)
                        }
                    }
                }
            }

As a result of this table, when Qemu injects an interrupt that is mapped to the GED device based on the interrupt number (0x12 in the case above), there is a method defined in the ACPI table that the OS needs to invoke to process the same. That method is _SB.PCI0.PCNT in the cae above.

ACPI PCI Hotplug Method

PCNT method is defined in the DSDT for each bus (pci host bridge) to perform the actions required in the OS on receipt of a hotplug event.

        Method (PCNT, 0, NotSerialized)
        {
            BNUM = Zero
            DVNT (PCIU, One)
            DVNT (PCID, 0x03)
        }
	   …
        Method (DVNT, 2, NotSerialized)
        {
            If ((Arg0 & 0x02))
            {
                Notify (S08, Arg1)
            }
            …
        }

Here the OS using the ACPI method will

Write 0 to BNUM, which will trigger a write event to QEMU triggering pci_write, which results in QEMU setting up the

s->hotplug_select = s->legacy_piix ? ACPI_PCIHP_BSEL_DEFAULT : data;
which then used when there is write to BEJ0
      if (s->hotplug_select >= ACPI_PCIHP_MAX_HOTPLUG_BUS) {
          break;
      }
      acpi_pcihp_eject_slot(s, s->hotplug_select, data);

Call DVNT (PCIU, One), which will then call Notify() for each and every slot that is marked as UP with argument One Call DVNT (PCID, 0x03), which will then call Notify() for each and every slot that is marked as DOWN with argument 0x03

Notify() is built in ACPI method implemented by the OS details of which are found in a subsequent section.

QEMU Handling of Hotplug field accesses

When the OS calls these methods, QEMU will detect the reads to PCI_UP_BASE and PCI_DOWN_BASE and provide the values as well as reset the slot state internally.

    case PCI_UP_BASE:
        val = s->acpi_pcihp_pci_status[bsel].up;
        s->acpi_pcihp_pci_status[bsel].up = 0;
    case PCI_DOWN_BASE:
        val = s->acpi_pcihp_pci_status[bsel].down;

Both 0x01 and 0x03 correspond to the ACPI definition

ACPI_NOTIFY_BUS_CHECK        	(u8) 0x00
ACPI_NOTIFY_DEVICE_CHECK     	(u8) 0x01
ACPI_NOTIFY_EJECT_REQUEST   	(u8) 0x03

Linux Bus Detection and Registration:

acpi_pci_root_add() -> pci_acpi_scan_root() -> acpi_pci_root_create()
pci_root_handler() registers acpi_pci_root_add for
static const struct acpi_device_id root_device_ids[] = {
        {"PNP0A03", 0},
        {"", 0},
};

Which maps to the DSDT entry corresponding to

    Device (\_SB.PCI0)
    {
        Name (_HID, EisaId ("PNP0A08") /* PCI Express Bus */)  // _HID: Hardware ID
        Name (_CID, EisaId ("PNP0A03") /* PCI Bus */)  // _CID: Compatible ID
        Name (_ADR, Zero)  // _ADR: Address

pci_root_handler() registers acpi_pci_root_add for all host bridges

The PCI Host Bridge capabilities are discovered based on
/* PCI Host Bridge _OSC: Capabilities DWORD 2: Support Field */
#define OSC_PCI_EXT_CONFIG_SUPPORT              0x00000001
#define OSC_PCI_ASPM_SUPPORT                    0x00000002
#define OSC_PCI_CLOCK_PM_SUPPORT                0x00000004
#define OSC_PCI_SEGMENT_GROUPS_SUPPORT          0x00000008
#define OSC_PCI_MSI_SUPPORT                     0x00000010
#define OSC_PCI_SUPPORT_MASKS                   0x0000001f

/* PCI Host Bridge _OSC: Capabilities DWORD 3: Control Field */
#define OSC_PCI_EXPRESS_NATIVE_HP_CONTROL       0x00000001
#define OSC_PCI_SHPC_NATIVE_HP_CONTROL          0x00000002
#define OSC_PCI_EXPRESS_PME_CONTROL             0x00000004
#define OSC_PCI_EXPRESS_AER_CONTROL             0x00000008
#define OSC_PCI_EXPRESS_CAPABILITY_CONTROL      0x00000010
#define OSC_PCI_CONTROL_MASKS                   0x0000001f

To which we report

        Method (_OSC, 4, NotSerialized)  // _OSC: Operating System Capabilities
        {
            CreateDWordField (Arg3, Zero, CDW1)
            If ((Arg0 == ToUUID ("33db4d5b-1ff7-401c-9657-7441c03dd766") /* PCI Host Bridge Device */))
            {
                CreateDWordField (Arg3, 0x04, CDW2)
                CreateDWordField (Arg3, 0x08, CDW3)
                SUPP = CDW2 /* \_SB_.PCI0._OSC.CDW2 */
                CTRL = CDW3 /* \_SB_.PCI0._OSC.CDW3 */
                CTRL = (CTRL & 0x1F)
                If ((Arg1 != One))
                {
                    CDW1 = (CDW1 | 0x08) [Note: Unknown revisions]
                }

                If ((CDW3 != CTRL))
                {
                    CDW1 = (CDW1 | 0x10) [Note:  Capabilities bits were masked]
                }

                CDW3 = CTRL /* \_SB_.PCI0.CTRL */
                Return (Arg3)
            }
            Else
            {
                CDW1 = (CDW1 | 0x04)
                Return (Arg3)
            }
        }

Which seems to indicate we support MSI but not PCIe Native hotplug. Note: We need to check the kernel output when PCIe Native hotplug in our kernel as they both cannot be enabled at the same time.

[    0.176004] acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig Segments MSI]
[    0.176515] acpi PNP0A08:00: _OSC: not requesting OS control; OS requires [ExtendedConfig ASPM ClockPM MSI]

Linux Hotplug Notify handling

The ACPI code to support the PCI hotplug through ACPI comes from the logic in acpiphp_add_context()

if ((acpi_pci_check_ejectable(pbus, handle) || is_dock_device(adev))
&& !(pdev && pdev->is_hotplug_bridge && pciehp_is_native(pdev))) 

So an ejectable device (i.e with _EJ0 method or _RMV field) which is not a pciehp native device as determined by OSC_PCI_EXPRESS_NATIVE_HP_CONTROL set on its host bridge will be handled via APCI PCI Hotplug.

acpi_pci_check_ejectable()
        if (!acpi_has_method(handle, "_ADR"))
                return 0;
        if (acpi_has_method(handle, "_EJ0"))
                return 1;
        status = acpi_evaluate_integer(handle, "_RMV", NULL, &removable);
        if (ACPI_SUCCESS(status) && removable)
                return 1;

Linux Notify handling

This Notify() method is an ACPI standard that will call into the guest OS with those parameters which is mapped to acpiphp_hotplug_notify() in drivers/pci/hotplug/acpiphp_glue.c.

The Notify method is registered as part of acpiphp_init_context by the following sequence

pci_acpi_scan_root()
pci_create_root_bus()
pci_register_host_bridge()
pcibios_add_bus()
acpi_pci_add_bus()
acpiphp_enumerate_slots()
acpiphp_init_context()

This results in the invocation of hotplug_event which will perform BUS_CHECK or EJECT logic based on the parameter used in the APCI Notify method.

static void hotplug_event(u32 type, struct acpiphp_context *context)
{
        switch (type) {
        case ACPI_NOTIFY_BUS_CHECK:
        case ACPI_NOTIFY_DEVICE_CHECK:
        case ACPI_NOTIFY_EJECT_REQUEST:
}

Which map to

            DVNT (PCIU, One)	-> ACPI_NOTIFY_BUS_CHECK 
            DVNT (PCID, 0x03)	-> ACPI_NOTIFY_EJECT_REQUEST 
/*
 * Standard notify values
 */
#define ACPI_NOTIFY_BUS_CHECK           (u8) 0x00
#define ACPI_NOTIFY_DEVICE_CHECK        (u8) 0x01
#define ACPI_NOTIFY_DEVICE_WAKE         (u8) 0x02
#define ACPI_NOTIFY_EJECT_REQUEST       (u8) 0x03
#define ACPI_NOTIFY_DEVICE_CHECK_LIGHT  (u8) 0x04
#define ACPI_NOTIFY_FREQUENCY_MISMATCH  (u8) 0x05
#define ACPI_NOTIFY_BUS_MODE_MISMATCH   (u8) 0x06
#define ACPI_NOTIFY_POWER_FAULT         (u8) 0x07
#define ACPI_NOTIFY_CAPABILITIES_CHECK  (u8) 0x08
#define ACPI_NOTIFY_DEVICE_PLD_CHECK    (u8) 0x09
#define ACPI_NOTIFY_RESERVED            (u8) 0x0A
#define ACPI_NOTIFY_LOCALITY_UPDATE     (u8) 0x0B
#define ACPI_NOTIFY_SHUTDOWN_REQUEST    (u8) 0x0C
#define ACPI_NOTIFY_AFFINITY_UPDATE     (u8) 0x0D
#define ACPI_NOTIFY_MEMORY_UPDATE       (u8) 0x0E

Hotplug Summary

To summarize.

  • APCI hotplug tables are created by QEMU populated with the IO Aperture, Field definitions and methods.
  • The VM is started. The DSDT table is provisioned with the methods and the devices to describe how the PCI bus needs to be rescanned.
  • The Operating system scans the DSDT table and registers the (notify) handlers for each bus.
  • We also define the GED device with a set of interrupts mapped to it and the associated methods to invoke.
  • For each GED interrupt we map a ACPI method. In the case of PCI, PCNT.
  • When hotplugging a device through Qemu monitor, qdev_add_device() gets called, which invokes the hotplug handler that will trigger the interrupt associated with the event for PCI hotplug.
  • Upon reception of the interrupt, the guest OS will invoke the ACPI method defined through the DSDT table that is associated with that particular bus.
  • PCNT, which is the method associated with the interrupts, invokes DVNT on PCIU and PCID
  • The method will trigger a scan of every slot marked as UP and eject every slot marked as DOWN from the PCI bus. The guest OS will probe new drivers for every PCI device discovered.
  • DVNT calls APCI Notify()
  • Notify() is implemented in the linux kernel which performs the discovery within the OS
  • PCNT also results in read/writes to the IO Fields which are handled by synchronous QEMU callback functions, which lets QEMU know that the APCI event has been processed/handled within the kernel.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment