Skip to content

Instantly share code, notes, and snippets.

@singlecheeze
Last active November 27, 2023 18:01
Show Gist options
  • Save singlecheeze/fb05595181bc56c8f588a4748e5c38c0 to your computer and use it in GitHub Desktop.
Save singlecheeze/fb05595181bc56c8f588a4748e5c38c0 to your computer and use it in GitHub Desktop.
OCP Virtualization Nvidia Pass-Through

Official Docs:

https://docs.openshift.com/container-platform/4.11/virt/virtual_machines/advanced_vm_management/virt-configuring-pci-passthrough.html

https://www.youtube.com/watch?v=lud0C-K3ya0

https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/openshift/nvidia-gpu-operator-openshift-virtualization-vgpu-enablement.html#nvidia-gpu-operator-with-openshift-virtualization

https://kubevirt.io/user-guide/virtual_machines/host-devices/

There is this too, but I haven't tried it:
https://github.com/NVIDIA/kubevirt-gpu-device-plugin

My Journey...

Links in chronological order (Or skip to the end 😏):
https://developer.nvidia.com/blog/gpu-containers-runtime/
https://www.fastcompression.com/pub/2020/CNS20856.pdf
https://developer.download.nvidia.com/compute/DevZone/C/html_x64/6_Advanced/simpleHyperQ/doc/HyperQ.pdf
https://developer.nvidia.com/blog/maximizing-gromacs-throughput-with-multiple-simulations-per-gpu-using-mps-and-mig/
https://developer.nvidia.com/blog/improving-gpu-utilization-in-kubernetes/
http://www.bytefold.com/sharing-gpu-in-kubernetes/

Operator Specific: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/platform-support.html

My Githubs on MPS:
NVIDIA/gpu-operator#420
https://gist.github.com/singlecheeze/d9b1f0b02b650e4499ac5e72937d7256

Work done by Amazon to effectively use vGPU and MPS:
https://aws.amazon.com/blogs/opensource/virtual-gpu-device-plugin-for-inference-workload-in-kubernetes/
HAPPY READING 😀

If Nvidia Operator is installed:

Label GPU Node(s)

[dave@lenovo ~]$ oc label node r730ocp3.localdomain --overwrite nvidia.com/gpu.workload.config=vm-passthrough
node/r730ocp3.localdomain labeled

Disable sandboxDevicePlugin/sandboxWorkloads/vfioManager on Nvidia Operator ClusterPolicy
Note: I think Nvidia Operator has the ability to manage the vFIO devices, but the docs are really slim, so I just disable it to prevent the pods from CrashLoopBackoff

  sandboxDevicePlugin:
    enabled: false
  validator:
    plugin:
      env:
        - name: WITH_WORKLOAD
          value: 'true'
  nodeStatusExporter:
    enabled: true
  daemonsets: {}
  sandboxWorkloads:
    defaultWorkload: vm-passthrough
    enabled: false
  vgpuManager:
    enabled: false
  vfioManager:
    enabled: false

Login to OCP Hosts and list PCI devices:

[core@r730ocp3 ~]$ lspci -nnk -d 10de: | grep -E 'VGA compatible controller|3D controller'
03:00.0 3D controller [0302]: NVIDIA Corporation GP104GL [Tesla P4] [10de:1bb3] (rev a1)
82:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1070 Ti] [10de:1b82] (rev a1)

[core@r730ocp4 ~]$ lspci -nnk -d 10de: | grep -E 'VGA compatible controller|3D controller'
82:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1070 Ti] [10de:1b82] (rev a1)

[core@r730ocp5 ~]$ lspci -nnk -d 10de: | grep -E 'VGA compatible controller|3D controller'
82:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1070 Ti] [10de:1b82] (rev a1)

[core@trt2ocp1 ~]$ lspci -nnk -d 10de: | grep -E 'VGA compatible controller|3D controller'
04:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1070 Ti] [10de:1b82] (rev a1)
05:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1070 Ti] [10de:1b82] (rev a1)
06:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1080] [10de:1b80] (rev a1)
07:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1080] [10de:1b80] (rev a1)
08:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1080] [10de:1b80] (rev a1)
0c:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1070 Ti] [10de:1b82] (rev a1)
0d:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1080] [10de:1b80] (rev a1)
0e:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1080] [10de:1b80] (rev a1)
0f:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1080] [10de:1b80] (rev a1)

[core@trt2ocp2 ~]$ lspci -nnk -d 10de: | grep -E 'VGA compatible controller|3D controller'
04:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1070 Ti] [10de:1b82] (rev a1)
05:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1070 Ti] [10de:1b82] (rev a1)
06:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1080] [10de:1b80] (rev a1)
07:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1080] [10de:1b80] (rev a1)
08:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1080] [10de:1b80] (rev a1)
0c:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1070 Ti] [10de:1b82] (rev a1)
0d:00.0 3D controller [0302]: NVIDIA Corporation GA102GL [A40] [10de:2235] (rev a1)
0e:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1080] [10de:1b80] (rev a1)
0f:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1080] [10de:1b80] (rev a1)

Enabling the IOMMU driver on hosts, and assigning cards to be passed-trhough to vFIO driver:

Note: As of OCP 4.11.17/OCP Virt 4.11.1, any accompaning device like an audio controller sometimes present on consumer graphics cards like the NVIDIA Corporation GP104 High Definition Audio Controller [10de:10f0] must be included in devices that are passed to vfio driver

Create butane file for worker nodes

variant: openshift
version: 4.11.0
metadata:
  name: nvidia-iommu-vfio-worker-trt2
  labels:
    machineconfiguration.openshift.io/role: worker
openshift:
    kernel_arguments:
      - intel_iommu=on
storage:
  files:
    - path: /etc/modprobe.d/vfio.conf
      mode: 0644
      overwrite: true
      contents:
        inline: |
          options vfio-pci ids=10de:1b82,10de:1b80,10de:2235,10de:10f0
    - path: /etc/modules-load.d/vfio-pci.conf
      mode: 0644
      overwrite: true
      contents:
        inline: vfio-pci

Create butane file for master nodes (If they are schedulable too)

variant: openshift
version: 4.11.0
metadata:
  name: nvidia-iommu-vfio-master-r730
  labels:
    machineconfiguration.openshift.io/role: master
openshift:
    kernel_arguments:
      - intel_iommu=on
storage:
  files:
    - path: /etc/modprobe.d/vfio.conf
      mode: 0644
      overwrite: true
      contents:
        inline: |
          options vfio-pci ids=10de:1bb3,10de:1b82,10de:10f0
    - path: /etc/modules-load.d/vfio-pci.conf
      mode: 0644
      overwrite: true
      contents:
        inline: vfio-pci

Build the MachineConfigs

[dave@lenovo]$ butane nvidia-iommu-vfio-master-r730.bu -o nvidia-iommu-vfio-master-r730.yaml

[dave@lenovo]$ cat nvidia-iommu-vfio-master-r730.yaml 
# Generated by Butane; do not edit
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: master
  name: nvidia-iommu-vfio-master-r730
spec:
  config:
    ignition:
      version: 3.2.0
    storage:
      files:
        - contents:
            compression: ""
            source: data:,options%20vfio-pci%20ids%3D10de%3A1bb3%2C10de%3A1b82%0A
          mode: 420
          overwrite: true
          path: /etc/modprobe.d/vfio.conf
        - contents:
            compression: ""
            source: data:,vfio-pci
          mode: 420
          overwrite: true
          path: /etc/modules-load.d/vfio-pci.conf
  kernelArguments:
    - intel_iommu=on
[dave@lenovo worker]$ butane nvidia-iommu-vfio-worker-trt2.bu -o nvidia-iommu-vfio-worker-trt2.yaml

[dave@lenovo worker]$ cat nvidia-iommu-vfio-worker-trt2.yaml
# Generated by Butane; do not edit
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: nvidia-iommu-vfio-worker-trt2
spec:
  config:
    ignition:
      version: 3.2.0
    storage:
      files:
        - contents:
            compression: ""
            source: data:,options%20vfio-pci%20ids%3D10de%3A1b82%2C10de%3A1b80%2C10de%3A2235%0A
          mode: 420
          overwrite: true
          path: /etc/modprobe.d/vfio.conf
        - contents:
            compression: ""
            source: data:,vfio-pci
          mode: 420
          overwrite: true
          path: /etc/modules-load.d/vfio-pci.conf
  kernelArguments:
    - intel_iommu=on

Apply the MachineConfigs (This will reboot hosts)

[dave@lenovo worker]$ oc create -f nvidia-iommu-vfio-worker-trt2.yaml 
machineconfig.machineconfiguration.openshift.io/nvidia-iommu-vfio-worker-trt2 created

[dave@lenovo worker]$ oc create -f nvidia-iommu-vfio-master-r730.yaml 
machineconfig.machineconfiguration.openshift.io/nvidia-iommu-vfio-master-r730 created

Once host reboots, verify IOMMU and vFIO
Note: Disregard HD Audio device entries and if the kernel driver in use is listed as nvidia, then go back and check PCI device IDs as the Nvidia operator must be installed and the host driver is controlling the device which will not allow passthrough to work.

[core@r730ocp4 ~]$ lspci -nnk -d 10de:
82:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1070 Ti] [10de:1b82] (rev a1)
	Subsystem: eVga.com. Corp. Device [3842:5671]
	Kernel driver in use: vfio-pci
	Kernel modules: nouveau

[core@r730ocp4 ~]$ dmesg | grep -i -e DMAR -e NVIDIA
[    0.000000] DMAR: IOMMU enabled
[    0.001004] DMAR: Host address width 46
[    0.002001] DMAR: DRHD base: 0x000000fbffc000 flags: 0x0
[    0.003005] DMAR: dmar0: reg_base_addr fbffc000 ver 1:0 cap d2078c106f0466 ecap f020df
[    0.004001] DMAR: DRHD base: 0x000000c7ffc000 flags: 0x1
[    0.005004] DMAR: dmar1: reg_base_addr c7ffc000 ver 1:0 cap d2078c106f0466 ecap f020df
[    0.006001] DMAR: ATSR flags: 0x0
[    0.007002] DMAR: ATSR flags: 0x0
[    0.008003] DMAR-IR: IOAPIC id 10 under DRHD base  0xfbffc000 IOMMU 0
[    0.009001] DMAR-IR: IOAPIC id 8 under DRHD base  0xc7ffc000 IOMMU 1
[    0.010001] DMAR-IR: IOAPIC id 9 under DRHD base  0xc7ffc000 IOMMU 1
[    0.011001] DMAR-IR: HPET id 0 under DRHD base 0xc7ffc000
[    0.012001] DMAR-IR: x2apic is disabled because BIOS sets x2apic opt out bit.
[    0.012002] DMAR-IR: Use 'intremap=no_x2apic_optout' to override the BIOS setting.
[    0.015122] DMAR-IR: Enabled IRQ remapping in xapic mode
[    5.373467] DMAR: No RMRR found
[    5.376976] DMAR: dmar0: Using Queued invalidation
[    5.382327] DMAR: dmar1: Using Queued invalidation
[    6.476324] DMAR: Intel(R) Virtualization Technology for Directed I/O
[   30.146298] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:80/0000:80:02.0/0000:82:00.1/sound/card0/input5
[   30.146348] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:80/0000:80:02.0/0000:82:00.1/sound/card0/input6
[   30.146386] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:80/0000:80:02.0/0000:82:00.1/sound/card0/input7
[   30.146424] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:80/0000:80:02.0/0000:82:00.1/sound/card0/input8
[   30.146459] input: HDA NVidia HDMI/DP,pcm=10 as /devices/pci0000:80/0000:80:02.0/0000:82:00.1/sound/card0/input9
[   30.146495] input: HDA NVidia HDMI/DP,pcm=11 as /devices/pci0000:80/0000:80:02.0/0000:82:00.1/sound/card0/input10
[   30.146533] input: HDA NVidia HDMI/DP,pcm=12 as /devices/pci0000:80/0000:80:02.0/0000:82:00.1/sound/card0/input11

Add GPU resources to the HyperConverged CR:

Update the HyperConverged Custom Resource, so that all GPU/vGPU devices in your cluster are permitted and can be assigned to OpenShift Virtualization VMs
image
Note: pciDeviceSelector does not match some docs pciVendorSelector, also you can name the devices whatever you'd like as it assigns by the pciDeviceSelector. Additionally, it helps if all of your nodes are populated with cards in the same slot. Mine are not, and that is why the list of device selectors is so long. This includes card device selectors from my R730, and TRT2 servers.

apiVersion: hco.kubevirt.io/v1beta1
kind: HyperConverged
metadata:
  name: kubevirt-hyperconverged
  namespace: openshift-cnv
spec:
  permittedHostDevices:
    pciHostDevices:
    - pciDeviceSelector: "10de:1bb3"
      resourceName: "nvidia.com/GP104GL_Tesla_P4"
    - pciDeviceSelector: "10de:1b82"
      resourceName: "nvidia.com/GP104_GeForce_GTX_1070_Ti"
    - pciDeviceSelector: "10de:1b80"
      resourceName: "nvidia.com/GP104_GeForce_GTX_1080"
    - pciDeviceSelector: "10de:2235"
      resourceName: "nvidia.com/GA102GL_A40"

Apply the edit to kubevirt-hyperconverged

[dave@lenovo]$ oc edit hyperconverged kubevirt-hyperconverged -n openshift-cnv
hyperconverged.hco.kubevirt.io/kubevirt-hyperconverged edited

Add GPU to Virtual Machine:

Note: YAML or OCP GUI, if GUI make sure to click the little check mark! image

spec:
  template:
    spec:
      domain:
        devices:
          hostDevices:
            - deviceName: nvidia.com/GP104GL_Tesla_P4
              name: gpu1

Note: Drivers must be loaded in the VM now:
image

Troubleshooting:

VM is not Schedulable
Note: You may encounter the below if the Nvidia Operator is still trying to claim the GPU or if the device naming used different case.

status:
  conditions:
    - lastProbeTime: null
      lastTransitionTime: '2022-11-22T18:05:26Z'
      message: >-
        0/5 nodes are available: 4 node(s) didn't match Pod's node
        affinity/selector, 5 Insufficient nvidia.com/GP104GL_Tesla_P4.
        preemption: 0/5 nodes are available: 1 No preemption victims found for
        incoming pod, 4 Preemption is not helpful for scheduling.
      reason: Unschedulable
      status: 'False'
      type: PodScheduled

The node has no allocatable devices (Notice device name case):

[dave@lenovo worker]$ oc describe node r730ocp3.localdomain
Name:               r730ocp3.localdomain
Roles:              master,worker
Capacity:
  cpu:                                   48
  devices.kubevirt.io/kvm:               1k
  devices.kubevirt.io/sev:               1k
  devices.kubevirt.io/tun:               1k
  devices.kubevirt.io/vhost-net:         1k
  ephemeral-storage:                     975688684Ki
  hugepages-1Gi:                         0
  hugepages-2Mi:                         0
  k8s.kuartis.com/vgpu:                  0
  memory:                                197807548Ki
  nvidia.com/GP104GL_TESLA_P4:           0
  nvidia.com/GP104_GEFORCE_GTX_1070_TI:  0
  nvidia.com/gpu:                        2
  pods:                                  250
Allocatable:
  cpu:                                   47500m
  devices.kubevirt.io/kvm:               1k
  devices.kubevirt.io/sev:               0
  devices.kubevirt.io/tun:               1k
  devices.kubevirt.io/vhost-net:         1k
  ephemeral-storage:                     899194689686
  hugepages-1Gi:                         0
  hugepages-2Mi:                         0
  k8s.kuartis.com/vgpu:                  0
  memory:                                196656572Ki
  nvidia.com/GP104GL_TESLA_P4:           0
  nvidia.com/GP104_GEFORCE_GTX_1070_TI:  0
  nvidia.com/gpu:                        0
  pods:                                  250

After removing Nvidia Operator and rebooting node:

Capacity:
  cpu:                                   48
  devices.kubevirt.io/kvm:               1k
  devices.kubevirt.io/sev:               1k
  devices.kubevirt.io/tun:               1k
  devices.kubevirt.io/vhost-net:         1k
  ephemeral-storage:                     975688684Ki
  hugepages-1Gi:                         0
  hugepages-2Mi:                         0
  k8s.kuartis.com/vgpu:                  0
  memory:                                197807548Ki
  nvidia.com/GP104GL_TESLA_P4:           0
  nvidia.com/GP104GL_Tesla_P4:           1
  nvidia.com/GP104_GEFORCE_GTX_1070_TI:  0
  nvidia.com/GP104_GeForce_GTX_1070_Ti:  1
  nvidia.com/gpu:                        0
  pods:                                  250
Allocatable:
  cpu:                                   47500m
  devices.kubevirt.io/kvm:               1k
  devices.kubevirt.io/sev:               0
  devices.kubevirt.io/tun:               1k
  devices.kubevirt.io/vhost-net:         1k
  ephemeral-storage:                     899194689686
  hugepages-1Gi:                         0
  hugepages-2Mi:                         0
  k8s.kuartis.com/vgpu:                  0
  memory:                                196656572Ki
  nvidia.com/GP104GL_TESLA_P4:           0
  nvidia.com/GP104GL_Tesla_P4:           1
  nvidia.com/GP104_GEFORCE_GTX_1070_TI:  0
  nvidia.com/GP104_GeForce_GTX_1070_Ti:  1
  nvidia.com/gpu:                        0
  pods:                                  250
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment