Skip to content

Instantly share code, notes, and snippets.

@sameo
Last active April 27, 2024 16:20
Show Gist options
  • Star 9 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save sameo/1c6131b3fda3c9e5ca1f3229459a9773 to your computer and use it in GitHub Desktop.
Save sameo/1c6131b3fda3c9e5ca1f3229459a9773 to your computer and use it in GitHub Desktop.

VFIO

Not KVM bound. The VFIO API deconstructs a device into regions, irqs, etc. The userspace application (QEMU, cloud-hypervisor, etc..) is responsible for reconstructing it into a device for e.g. a guest VM to consume.

Boot with intel_iommu=on.

IOMMU groups

Devices are bound together for isolation, IOMMU capabilities and platform topology reasons. It is not configurable.

These are IOMMU groups.

$ ls /sys/kernel/iommu_groups/
0  1  10  11  12  13  19  2  3  4  5  6  7  8  9

VFIO Objects

Groups

VFIO groups mapped to IOMMU groups. VFIO group <-> IOMMU group.

When binding a device to the vfio-pci kernel driver, the kernel creates the corresponding group under /dev/vfio. Let's follow a simple example:

$ lspci -v
[...]
01:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. RTS525A PCI Express Card Reader (rev 01)
	Subsystem: Dell Device 07e6
	Flags: bus master, fast devsel, latency 0, IRQ 133
	Memory at dc300000 (32-bit, non-prefetchable) [size=4K]
	Capabilities: [80] Power Management version 3
	Capabilities: [90] MSI: Enable+ Count=1/1 Maskable- 64bit+
	Capabilities: [b0] Express Endpoint, MSI 00
	Capabilities: [100] Advanced Error Reporting
	Capabilities: [148] Device Serial Number 00-00-00-01-00-4c-e0-00
	Capabilities: [158] Latency Tolerance Reporting
	Capabilities: [160] L1 PM Substates
	Kernel driver in use: rtsx_pci
	Kernel modules: rtsx_pci
[...]

We have a PCI card reader.

$ readlink /sys/bus/pci/devices/0000\:01\:00.0/iommu_group
../../../../kernel/iommu_groups/12

It belongs to the IOMMU group number 12.

$ ls /sys/bus/pci/devices/0000\:01\:00.0/iommu_group/devices/
0000:01:00.0

It is alone in that group, there's one single PCI device in IOMMU group number 12.

Next we need to unbind this device from its host kernel driver (rtsx_pci) and have the vfio_pci drive it so that we can control it from userspace and from our VMM.

# Add the vfio-pci driver
$ modprobe vfio_pci

# Get the device VID/PID
$ lspci -n -s 0000:01:00.0
01:00.0 ff00: 10ec:525a (rev 01)

# Unbind it from its default driver
$ echo 0000:01:00.0 > /sys/bus/pci/devices/0000\:01\:00.0/driver/unbind

# Have vfio-pci drive it
$ echo 10ec 525a > /sys/bus/pci/drivers/vfio-pci/new_id

The whole IOMMU group this device belongs to is now driven by the vfio-pci device. As a consequence, the vfio-pci created a VFIO group:

$ ls /dev/vfio/12 
/dev/vfio/12

Userspace has now full access to the whole IOMMU group and all devices belonging to it.

Next we need to:

  • Create a VFIO container
  • Add our VFIO group to this container
  • Map and control our device

VFIO Group API

Container

A VFIO container is a collection of VFIO groups logically bound together. Linking VFIO groups together through a VFIO container makes sense when a userspace application is going to access several VFIO groups. It is more efficient to share page tables between groups and avoid TLB trashing.

A VFIO container by itself is not very useful, but the VFIO API for a given group is not accessible until it's added to a container.

Once added to a container, all devices from a given group can be mapped and controlled by userspace.

VFIO Container API

  • Create a container: container_fd = open(/dev/vfio/vfio);

Device

A VFIO device is represented by a file descriptor. This file descriptor ir returned from the VFIO_GROUP_GET_DEVICE_FD ioctl on the device's group. The ioctl takes the device path as an argument: segment:bus:device.function. In our example, this is 0000:01:00.0.

Each VFIO device resource is represented by a VFIO region, and the device file descriptor gives access to the VFIO regions.

VFIO Device API

  • VFIO_DEVICE_GET_INFO: Gets the device flags and number of associated regions and irqs.
struct vfio_device_info {
	__u32	argsz;
	__u32	flags;
#define VFIO_DEVICE_FLAGS_RESET	(1 << 0)	/* Device supports reset */
#define VFIO_DEVICE_FLAGS_PCI	(1 << 1)	/* vfio-pci device */
#define VFIO_DEVICE_FLAGS_PLATFORM (1 << 2)	/* vfio-platform device */
#define VFIO_DEVICE_FLAGS_AMBA  (1 << 3)	/* vfio-amba device */
#define VFIO_DEVICE_FLAGS_CCW	(1 << 4)	/* vfio-ccw device */
#define VFIO_DEVICE_FLAGS_AP	(1 << 5)	/* vfio-ap device */
	__u32	num_regions;	/* Max region index + 1 */
	__u32	num_irqs;	/* Max IRQ index + 1 */
};
  • VFIO_DEVICE_GET_REGION_INFO:

Regions

Each VFIO region associated to a VFIO device represents a device resource (BARs, configuration space, etc).

VFIO and KVM

VFIO is not bound to KVM and can be used outside of the hardware virtualization context.

However, when using VFIO for assigning host devices into a KVM based guest, we need to let KVM know about the VFIO groups we're adding or removing. The KVM API to do so is the device one. There is a VFIO KVM device type (KVM_DEV_TYPE_VFIO) and there must be one single KVM VFIO device per guest. It is not a representation of the VFIO devices we want to assign to the guest, but rather a VFIO KVM device entry point.

Any VFIO group is added/removed to this pseudo device by setting the VFIO KVM device attribute (KVM_DEV_VFIO_GROUP_ADD and KVM_DEV_VFIO_GROUP_DEL).

Interrupts

write_config_register() -> device.msi_enable() -> add_msi_routing()

When the guest programs the device with an MSI vector, and we have an interrupt event for the device, we do the following:

  • enable_msi() calls the VFIO_DEVICE_SET_IRQS() ioctl to have the kernel write to an eventfd whenever the programmed MSI is triggered. As the eventfs has previously been associated with a guest IRQ (through register_irqfd()), the MSI triggered from the physical device will generate a guest interrupt. VFIO_DEVICE_SET_IRQS() sets an interrupt handler for the device physical interrupt, in both the MSI and legacy cases. The interrupt handler only writes to the eventfd file descriptor passed through the API. This ioctl also indirectly enables posted interrupts by calling into the irqbypass kernel API.

  • add_msi_routing() sets a GSI routing entry to map the guest IRQ to the programmed MSI vector, and have the guest handle the MSI vector and not the VMM chosen IRQ.

As a summary, when a vfio/physical device triggers an interrupt, there are 2 cases:

  1. The guest is running

    • The device writes to the programmed MSI vector.
    • As the device is running, this triggers a posted, remapped interrupt directly into the guest
  2. The guest is not running

    • The device writes to the programmed MSI vector.
    • This triggers a host interrupt.
    • VFIO catches that interrupt.
    • VFIO writes to the eventfd the VMM gave it to when calling into the VFIO_DEVICE_SET_IRQS() ioctl.
    • KVM receives the eventfd write.
    • KVM remaps the IRQ linked to the eventfd to a guest MSI vector. This has been set by the add_msi_routing() call (KVM GSI routing table).
    • KVM injects the MSI interrupt into the guest.
    • The guest handles the interrupt next time it's scheduled
    • (Should we handled the eventfd write from the VMM to force a guest run?)

Example

Good set of C examples at https://github.com/awilliam/tests

Deconstructing and Reconstructing

When binding a physical device to the vfio-pci, we're essentially deconstructing it. In other words we're splitting it apart in separate resources (VFIO regions). At that point the device is no longer usable and not managed by any driver.

The idea behind VFIO is to completely or partially reconstruct the device in userspace. In a VM/virtualization context, the VMM reassemble those separate resources to reconstruct a guest device by:

  • Building and emulating the guest device PCI configuration space. It is up to the VMM to decide what it wants to expose from the physical device configuration space.

  • Emulating all BARs MMIO reads and writes. The guest will for example set DMA transfers by writing at specific offsets in special BARs, and the VMM is responsible for trapping those writes and translating them into VFIO API calls. This will program the physical device accordingly.

  • Setting the IOMMU interrupt remapping, based on the device interrupt information given by the VFIO API (To be Documented)

  • Setting the DMA remapping, mostly by adding the whole guest physical memory to the IOMMU table. This is again done through the VFIO API. When the driver in the guest programs a DMA transfer, the VMM translates that into physical device programming via VFIO calls. The DMA transfer then starts (The PCI device will become a memory bus master) and will use a guest physical address as either a source or destination IOVA (I/O virtual address). The IOMMU translates that into a host virtual address, as programmed by the VMM through the VFIO DMA mapping APIs.

References

Debugging

VFIO kernel traces:

$ trace-cmd record -p function_graph -l vfio_*
$ trace-cmd report

KVM events and functions:

$ trace-cmd record -p function_graph -l kvm_*
$ trace-cmd report
$ trace-cmd record -e kvm_*
$ trace-cmd report
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment