singlecheeze/OCP Virtualization Nvidia Pass-Through.md

## OCP Virtualization Nvidia Pass-Through.md

      
    Raw
  

              OCP Virtualization Nvidia Pass-Through.md
            
          
    Official Docs:

https://docs.openshift.com/container-platform/4.11/virt/virtual_machines/advanced_vm_management/virt-configuring-pci-passthrough.html
https://www.youtube.com/watch?v=lud0C-K3ya0
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/openshift/nvidia-gpu-operator-openshift-virtualization-vgpu-enablement.html#nvidia-gpu-operator-with-openshift-virtualization
https://kubevirt.io/user-guide/virtual_machines/host-devices/
There is this too, but I haven't tried it:

https://github.com/NVIDIA/kubevirt-gpu-device-plugin
My Journey...

Links in chronological order (Or skip to the end 😏):

https://developer.nvidia.com/blog/gpu-containers-runtime/

https://www.fastcompression.com/pub/2020/CNS20856.pdf

https://developer.download.nvidia.com/compute/DevZone/C/html_x64/6_Advanced/simpleHyperQ/doc/HyperQ.pdf

https://developer.nvidia.com/blog/maximizing-gromacs-throughput-with-multiple-simulations-per-gpu-using-mps-and-mig/

https://developer.nvidia.com/blog/improving-gpu-utilization-in-kubernetes/

http://www.bytefold.com/sharing-gpu-in-kubernetes/
Operator Specific:
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/platform-support.html
My Githubs on MPS:

NVIDIA/gpu-operator#420

https://gist.github.com/singlecheeze/d9b1f0b02b650e4499ac5e72937d7256
Work done by Amazon to effectively use vGPU and MPS:

https://aws.amazon.com/blogs/opensource/virtual-gpu-device-plugin-for-inference-workload-in-kubernetes/

HAPPY READING 😀
If Nvidia Operator is installed:

Label GPU Node(s)
[dave@lenovo ~]$ oc label node r730ocp3.localdomain --overwrite nvidia.com/gpu.workload.config=vm-passthrough
node/r730ocp3.localdomain labeled

Disable sandboxDevicePlugin/sandboxWorkloads/vfioManager on Nvidia Operator ClusterPolicy

Note: I think Nvidia Operator has the ability to manage the vFIO devices, but the docs are really slim, so I just disable it to prevent the pods from CrashLoopBackoff
  sandboxDevicePlugin:
    enabled: false
  validator:
    plugin:
      env:
        - name: WITH_WORKLOAD
          value: 'true'
  nodeStatusExporter:
    enabled: true
  daemonsets: {}
  sandboxWorkloads:
    defaultWorkload: vm-passthrough
    enabled: false
  vgpuManager:
    enabled: false
  vfioManager:
    enabled: false
Login to OCP Hosts and list PCI devices:

[core@r730ocp3 ~]$ lspci -nnk -d 10de: | grep -E 'VGA compatible controller|3D controller'
03:00.0 3D controller [0302]: NVIDIA Corporation GP104GL [Tesla P4] [10de:1bb3] (rev a1)
82:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1070 Ti] [10de:1b82] (rev a1)

[core@r730ocp4 ~]$ lspci -nnk -d 10de: | grep -E 'VGA compatible controller|3D controller'
82:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1070 Ti] [10de:1b82] (rev a1)

[core@r730ocp5 ~]$ lspci -nnk -d 10de: | grep -E 'VGA compatible controller|3D controller'
82:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1070 Ti] [10de:1b82] (rev a1)

[core@trt2ocp1 ~]$ lspci -nnk -d 10de: | grep -E 'VGA compatible controller|3D controller'
04:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1070 Ti] [10de:1b82] (rev a1)
05:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1070 Ti] [10de:1b82] (rev a1)
06:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1080] [10de:1b80] (rev a1)
07:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1080] [10de:1b80] (rev a1)
08:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1080] [10de:1b80] (rev a1)
0c:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1070 Ti] [10de:1b82] (rev a1)
0d:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1080] [10de:1b80] (rev a1)
0e:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1080] [10de:1b80] (rev a1)
0f:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1080] [10de:1b80] (rev a1)

[core@trt2ocp2 ~]$ lspci -nnk -d 10de: | grep -E 'VGA compatible controller|3D controller'
04:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1070 Ti] [10de:1b82] (rev a1)
05:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1070 Ti] [10de:1b82] (rev a1)
06:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1080] [10de:1b80] (rev a1)
07:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1080] [10de:1b80] (rev a1)
08:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1080] [10de:1b80] (rev a1)
0c:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1070 Ti] [10de:1b82] (rev a1)
0d:00.0 3D controller [0302]: NVIDIA Corporation GA102GL [A40] [10de:2235] (rev a1)
0e:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1080] [10de:1b80] (rev a1)
0f:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1080] [10de:1b80] (rev a1)


Enabling the IOMMU driver on hosts, and assigning cards to be passed-trhough to vFIO driver:

Note: As of OCP 4.11.17/OCP Virt 4.11.1, any accompaning device like an audio controller sometimes present on consumer graphics cards like the NVIDIA Corporation GP104 High Definition Audio Controller [10de:10f0] must be included in devices that are passed to vfio driver
Create butane file for worker nodes
variant: openshift
version: 4.11.0
metadata:
  name: nvidia-iommu-vfio-worker-trt2
  labels:
    machineconfiguration.openshift.io/role: worker
openshift:
    kernel_arguments:
      - intel_iommu=on
storage:
  files:
    - path: /etc/modprobe.d/vfio.conf
      mode: 0644
      overwrite: true
      contents:
        inline: |
          options vfio-pci ids=10de:1b82,10de:1b80,10de:2235,10de:10f0
    - path: /etc/modules-load.d/vfio-pci.conf
      mode: 0644
      overwrite: true
      contents:
        inline: vfio-pci
Create butane file for master nodes (If they are schedulable too)
variant: openshift
version: 4.11.0
metadata:
  name: nvidia-iommu-vfio-master-r730
  labels:
    machineconfiguration.openshift.io/role: master
openshift:
    kernel_arguments:
      - intel_iommu=on
storage:
  files:
    - path: /etc/modprobe.d/vfio.conf
      mode: 0644
      overwrite: true
      contents:
        inline: |
          options vfio-pci ids=10de:1bb3,10de:1b82,10de:10f0
    - path: /etc/modules-load.d/vfio-pci.conf
      mode: 0644
      overwrite: true
      contents:
        inline: vfio-pci
Build the MachineConfigs
[dave@lenovo]$ butane nvidia-iommu-vfio-master-r730.bu -o nvidia-iommu-vfio-master-r730.yaml

[dave@lenovo]$ cat nvidia-iommu-vfio-master-r730.yaml 
# Generated by Butane; do not edit
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: master
  name: nvidia-iommu-vfio-master-r730
spec:
  config:
    ignition:
      version: 3.2.0
    storage:
      files:
        - contents:
            compression: ""
            source: data:,options%20vfio-pci%20ids%3D10de%3A1bb3%2C10de%3A1b82%0A
          mode: 420
          overwrite: true
          path: /etc/modprobe.d/vfio.conf
        - contents:
            compression: ""
            source: data:,vfio-pci
          mode: 420
          overwrite: true
          path: /etc/modules-load.d/vfio-pci.conf
  kernelArguments:
    - intel_iommu=on

[dave@lenovo worker]$ butane nvidia-iommu-vfio-worker-trt2.bu -o nvidia-iommu-vfio-worker-trt2.yaml

[dave@lenovo worker]$ cat nvidia-iommu-vfio-worker-trt2.yaml
# Generated by Butane; do not edit
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: nvidia-iommu-vfio-worker-trt2
spec:
  config:
    ignition:
      version: 3.2.0
    storage:
      files:
        - contents:
            compression: ""
            source: data:,options%20vfio-pci%20ids%3D10de%3A1b82%2C10de%3A1b80%2C10de%3A2235%0A
          mode: 420
          overwrite: true
          path: /etc/modprobe.d/vfio.conf
        - contents:
            compression: ""
            source: data:,vfio-pci
          mode: 420
          overwrite: true
          path: /etc/modules-load.d/vfio-pci.conf
  kernelArguments:
    - intel_iommu=on

Apply the MachineConfigs (This will reboot hosts)
[dave@lenovo worker]$ oc create -f nvidia-iommu-vfio-worker-trt2.yaml 
machineconfig.machineconfiguration.openshift.io/nvidia-iommu-vfio-worker-trt2 created

[dave@lenovo worker]$ oc create -f nvidia-iommu-vfio-master-r730.yaml 
machineconfig.machineconfiguration.openshift.io/nvidia-iommu-vfio-master-r730 created

Once host reboots, verify IOMMU and vFIO

Note: Disregard HD Audio device entries and if the kernel driver in use is listed as nvidia, then go back and check PCI device IDs as the Nvidia operator must be installed and the host driver is controlling the device which will not allow passthrough to work.
[core@r730ocp4 ~]$ lspci -nnk -d 10de:
82:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1070 Ti] [10de:1b82] (rev a1)
	Subsystem: eVga.com. Corp. Device [3842:5671]
	Kernel driver in use: vfio-pci
	Kernel modules: nouveau

[core@r730ocp4 ~]$ dmesg | grep -i -e DMAR -e NVIDIA
[    0.000000] DMAR: IOMMU enabled
[    0.001004] DMAR: Host address width 46
[    0.002001] DMAR: DRHD base: 0x000000fbffc000 flags: 0x0
[    0.003005] DMAR: dmar0: reg_base_addr fbffc000 ver 1:0 cap d2078c106f0466 ecap f020df
[    0.004001] DMAR: DRHD base: 0x000000c7ffc000 flags: 0x1
[    0.005004] DMAR: dmar1: reg_base_addr c7ffc000 ver 1:0 cap d2078c106f0466 ecap f020df
[    0.006001] DMAR: ATSR flags: 0x0
[    0.007002] DMAR: ATSR flags: 0x0
[    0.008003] DMAR-IR: IOAPIC id 10 under DRHD base  0xfbffc000 IOMMU 0
[    0.009001] DMAR-IR: IOAPIC id 8 under DRHD base  0xc7ffc000 IOMMU 1
[    0.010001] DMAR-IR: IOAPIC id 9 under DRHD base  0xc7ffc000 IOMMU 1
[    0.011001] DMAR-IR: HPET id 0 under DRHD base 0xc7ffc000
[    0.012001] DMAR-IR: x2apic is disabled because BIOS sets x2apic opt out bit.
[    0.012002] DMAR-IR: Use 'intremap=no_x2apic_optout' to override the BIOS setting.
[    0.015122] DMAR-IR: Enabled IRQ remapping in xapic mode
[    5.373467] DMAR: No RMRR found
[    5.376976] DMAR: dmar0: Using Queued invalidation
[    5.382327] DMAR: dmar1: Using Queued invalidation
[    6.476324] DMAR: Intel(R) Virtualization Technology for Directed I/O
[   30.146298] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:80/0000:80:02.0/0000:82:00.1/sound/card0/input5
[   30.146348] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:80/0000:80:02.0/0000:82:00.1/sound/card0/input6
[   30.146386] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:80/0000:80:02.0/0000:82:00.1/sound/card0/input7
[   30.146424] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:80/0000:80:02.0/0000:82:00.1/sound/card0/input8
[   30.146459] input: HDA NVidia HDMI/DP,pcm=10 as /devices/pci0000:80/0000:80:02.0/0000:82:00.1/sound/card0/input9
[   30.146495] input: HDA NVidia HDMI/DP,pcm=11 as /devices/pci0000:80/0000:80:02.0/0000:82:00.1/sound/card0/input10
[   30.146533] input: HDA NVidia HDMI/DP,pcm=12 as /devices/pci0000:80/0000:80:02.0/0000:82:00.1/sound/card0/input11

Add GPU resources to the HyperConverged CR:

Update the HyperConverged Custom Resource, so that all GPU/vGPU devices in your cluster are permitted and can be assigned to OpenShift Virtualization VMs


Note: pciDeviceSelector does not match some docs pciVendorSelector, also you can name the devices whatever you'd like as it assigns by the pciDeviceSelector. Additionally, it helps if all of your nodes are populated with cards in the same slot. Mine are not, and that is why the list of device selectors is so long. This includes card device selectors from my R730, and TRT2 servers.
apiVersion: hco.kubevirt.io/v1beta1
kind: HyperConverged
metadata:
  name: kubevirt-hyperconverged
  namespace: openshift-cnv
spec:
  permittedHostDevices:
    pciHostDevices:
    - pciDeviceSelector: "10de:1bb3"
      resourceName: "nvidia.com/GP104GL_Tesla_P4"
    - pciDeviceSelector: "10de:1b82"
      resourceName: "nvidia.com/GP104_GeForce_GTX_1070_Ti"
    - pciDeviceSelector: "10de:1b80"
      resourceName: "nvidia.com/GP104_GeForce_GTX_1080"
    - pciDeviceSelector: "10de:2235"
      resourceName: "nvidia.com/GA102GL_A40"
Apply the edit to kubevirt-hyperconverged
[dave@lenovo]$ oc edit hyperconverged kubevirt-hyperconverged -n openshift-cnv
hyperconverged.hco.kubevirt.io/kubevirt-hyperconverged edited

Add GPU to Virtual Machine:

Note: YAML or OCP GUI, if GUI make sure to click the little check mark!

spec:
  template:
    spec:
      domain:
        devices:
          hostDevices:
            - deviceName: nvidia.com/GP104GL_Tesla_P4
              name: gpu1
Note: Drivers must be loaded in the VM now:


Troubleshooting:

VM is not Schedulable

Note: You may encounter the below if the Nvidia Operator is still trying to claim the GPU or if the device naming used different case.
status:
  conditions:
    - lastProbeTime: null
      lastTransitionTime: '2022-11-22T18:05:26Z'
      message: >-
        0/5 nodes are available: 4 node(s) didn't match Pod's node
        affinity/selector, 5 Insufficient nvidia.com/GP104GL_Tesla_P4.
        preemption: 0/5 nodes are available: 1 No preemption victims found for
        incoming pod, 4 Preemption is not helpful for scheduling.
      reason: Unschedulable
      status: 'False'
      type: PodScheduled
The node has no allocatable devices (Notice device name case):
[dave@lenovo worker]$ oc describe node r730ocp3.localdomain
Name:               r730ocp3.localdomain
Roles:              master,worker
Capacity:
  cpu:                                   48
  devices.kubevirt.io/kvm:               1k
  devices.kubevirt.io/sev:               1k
  devices.kubevirt.io/tun:               1k
  devices.kubevirt.io/vhost-net:         1k
  ephemeral-storage:                     975688684Ki
  hugepages-1Gi:                         0
  hugepages-2Mi:                         0
  k8s.kuartis.com/vgpu:                  0
  memory:                                197807548Ki
  nvidia.com/GP104GL_TESLA_P4:           0
  nvidia.com/GP104_GEFORCE_GTX_1070_TI:  0
  nvidia.com/gpu:                        2
  pods:                                  250
Allocatable:
  cpu:                                   47500m
  devices.kubevirt.io/kvm:               1k
  devices.kubevirt.io/sev:               0
  devices.kubevirt.io/tun:               1k
  devices.kubevirt.io/vhost-net:         1k
  ephemeral-storage:                     899194689686
  hugepages-1Gi:                         0
  hugepages-2Mi:                         0
  k8s.kuartis.com/vgpu:                  0
  memory:                                196656572Ki
  nvidia.com/GP104GL_TESLA_P4:           0
  nvidia.com/GP104_GEFORCE_GTX_1070_TI:  0
  nvidia.com/gpu:                        0
  pods:                                  250

After removing Nvidia Operator and rebooting node:
Capacity:
  cpu:                                   48
  devices.kubevirt.io/kvm:               1k
  devices.kubevirt.io/sev:               1k
  devices.kubevirt.io/tun:               1k
  devices.kubevirt.io/vhost-net:         1k
  ephemeral-storage:                     975688684Ki
  hugepages-1Gi:                         0
  hugepages-2Mi:                         0
  k8s.kuartis.com/vgpu:                  0
  memory:                                197807548Ki
  nvidia.com/GP104GL_TESLA_P4:           0
  nvidia.com/GP104GL_Tesla_P4:           1
  nvidia.com/GP104_GEFORCE_GTX_1070_TI:  0
  nvidia.com/GP104_GeForce_GTX_1070_Ti:  1
  nvidia.com/gpu:                        0
  pods:                                  250
Allocatable:
  cpu:                                   47500m
  devices.kubevirt.io/kvm:               1k
  devices.kubevirt.io/sev:               0
  devices.kubevirt.io/tun:               1k
  devices.kubevirt.io/vhost-net:         1k
  ephemeral-storage:                     899194689686
  hugepages-1Gi:                         0
  hugepages-2Mi:                         0
  k8s.kuartis.com/vgpu:                  0
  memory:                                196656572Ki
  nvidia.com/GP104GL_TESLA_P4:           0
  nvidia.com/GP104GL_Tesla_P4:           1
  nvidia.com/GP104_GEFORCE_GTX_1070_TI:  0
  nvidia.com/GP104_GeForce_GTX_1070_Ti:  1
  nvidia.com/gpu:                        0
  pods:                                  250