https://www.youtube.com/watch?v=lud0C-K3ya0
https://kubevirt.io/user-guide/virtual_machines/host-devices/
There is this too, but I haven't tried it:
https://github.com/NVIDIA/kubevirt-gpu-device-plugin
Links in chronological order (Or skip to the end 😏):
https://developer.nvidia.com/blog/gpu-containers-runtime/
https://www.fastcompression.com/pub/2020/CNS20856.pdf
https://developer.download.nvidia.com/compute/DevZone/C/html_x64/6_Advanced/simpleHyperQ/doc/HyperQ.pdf
https://developer.nvidia.com/blog/maximizing-gromacs-throughput-with-multiple-simulations-per-gpu-using-mps-and-mig/
https://developer.nvidia.com/blog/improving-gpu-utilization-in-kubernetes/
http://www.bytefold.com/sharing-gpu-in-kubernetes/
Operator Specific: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/platform-support.html
My Githubs on MPS:
NVIDIA/gpu-operator#420
https://gist.github.com/singlecheeze/d9b1f0b02b650e4499ac5e72937d7256
Work done by Amazon to effectively use vGPU and MPS:
https://aws.amazon.com/blogs/opensource/virtual-gpu-device-plugin-for-inference-workload-in-kubernetes/
HAPPY READING 😀
Label GPU Node(s)
[dave@lenovo ~]$ oc label node r730ocp3.localdomain --overwrite nvidia.com/gpu.workload.config=vm-passthrough
node/r730ocp3.localdomain labeled
Disable sandboxDevicePlugin
/sandboxWorkloads
/vfioManager
on Nvidia Operator ClusterPolicy
Note: I think Nvidia Operator has the ability to manage the vFIO devices, but the docs are really slim, so I just disable it to prevent the pods from CrashLoopBackoff
sandboxDevicePlugin:
enabled: false
validator:
plugin:
env:
- name: WITH_WORKLOAD
value: 'true'
nodeStatusExporter:
enabled: true
daemonsets: {}
sandboxWorkloads:
defaultWorkload: vm-passthrough
enabled: false
vgpuManager:
enabled: false
vfioManager:
enabled: false
[core@r730ocp3 ~]$ lspci -nnk -d 10de: | grep -E 'VGA compatible controller|3D controller'
03:00.0 3D controller [0302]: NVIDIA Corporation GP104GL [Tesla P4] [10de:1bb3] (rev a1)
82:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1070 Ti] [10de:1b82] (rev a1)
[core@r730ocp4 ~]$ lspci -nnk -d 10de: | grep -E 'VGA compatible controller|3D controller'
82:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1070 Ti] [10de:1b82] (rev a1)
[core@r730ocp5 ~]$ lspci -nnk -d 10de: | grep -E 'VGA compatible controller|3D controller'
82:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1070 Ti] [10de:1b82] (rev a1)
[core@trt2ocp1 ~]$ lspci -nnk -d 10de: | grep -E 'VGA compatible controller|3D controller'
04:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1070 Ti] [10de:1b82] (rev a1)
05:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1070 Ti] [10de:1b82] (rev a1)
06:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1080] [10de:1b80] (rev a1)
07:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1080] [10de:1b80] (rev a1)
08:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1080] [10de:1b80] (rev a1)
0c:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1070 Ti] [10de:1b82] (rev a1)
0d:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1080] [10de:1b80] (rev a1)
0e:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1080] [10de:1b80] (rev a1)
0f:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1080] [10de:1b80] (rev a1)
[core@trt2ocp2 ~]$ lspci -nnk -d 10de: | grep -E 'VGA compatible controller|3D controller'
04:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1070 Ti] [10de:1b82] (rev a1)
05:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1070 Ti] [10de:1b82] (rev a1)
06:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1080] [10de:1b80] (rev a1)
07:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1080] [10de:1b80] (rev a1)
08:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1080] [10de:1b80] (rev a1)
0c:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1070 Ti] [10de:1b82] (rev a1)
0d:00.0 3D controller [0302]: NVIDIA Corporation GA102GL [A40] [10de:2235] (rev a1)
0e:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1080] [10de:1b80] (rev a1)
0f:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1080] [10de:1b80] (rev a1)
Note: As of OCP 4.11.17/OCP Virt 4.11.1, any accompaning device like an audio controller sometimes present on consumer graphics cards like the NVIDIA Corporation GP104 High Definition Audio Controller [10de:10f0]
must be included in devices that are passed to vfio driver
Create butane file for worker nodes
variant: openshift
version: 4.11.0
metadata:
name: nvidia-iommu-vfio-worker-trt2
labels:
machineconfiguration.openshift.io/role: worker
openshift:
kernel_arguments:
- intel_iommu=on
storage:
files:
- path: /etc/modprobe.d/vfio.conf
mode: 0644
overwrite: true
contents:
inline: |
options vfio-pci ids=10de:1b82,10de:1b80,10de:2235,10de:10f0
- path: /etc/modules-load.d/vfio-pci.conf
mode: 0644
overwrite: true
contents:
inline: vfio-pci
Create butane file for master nodes (If they are schedulable too)
variant: openshift
version: 4.11.0
metadata:
name: nvidia-iommu-vfio-master-r730
labels:
machineconfiguration.openshift.io/role: master
openshift:
kernel_arguments:
- intel_iommu=on
storage:
files:
- path: /etc/modprobe.d/vfio.conf
mode: 0644
overwrite: true
contents:
inline: |
options vfio-pci ids=10de:1bb3,10de:1b82,10de:10f0
- path: /etc/modules-load.d/vfio-pci.conf
mode: 0644
overwrite: true
contents:
inline: vfio-pci
Build the MachineConfigs
[dave@lenovo]$ butane nvidia-iommu-vfio-master-r730.bu -o nvidia-iommu-vfio-master-r730.yaml
[dave@lenovo]$ cat nvidia-iommu-vfio-master-r730.yaml
# Generated by Butane; do not edit
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
labels:
machineconfiguration.openshift.io/role: master
name: nvidia-iommu-vfio-master-r730
spec:
config:
ignition:
version: 3.2.0
storage:
files:
- contents:
compression: ""
source: data:,options%20vfio-pci%20ids%3D10de%3A1bb3%2C10de%3A1b82%0A
mode: 420
overwrite: true
path: /etc/modprobe.d/vfio.conf
- contents:
compression: ""
source: data:,vfio-pci
mode: 420
overwrite: true
path: /etc/modules-load.d/vfio-pci.conf
kernelArguments:
- intel_iommu=on
[dave@lenovo worker]$ butane nvidia-iommu-vfio-worker-trt2.bu -o nvidia-iommu-vfio-worker-trt2.yaml
[dave@lenovo worker]$ cat nvidia-iommu-vfio-worker-trt2.yaml
# Generated by Butane; do not edit
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
labels:
machineconfiguration.openshift.io/role: worker
name: nvidia-iommu-vfio-worker-trt2
spec:
config:
ignition:
version: 3.2.0
storage:
files:
- contents:
compression: ""
source: data:,options%20vfio-pci%20ids%3D10de%3A1b82%2C10de%3A1b80%2C10de%3A2235%0A
mode: 420
overwrite: true
path: /etc/modprobe.d/vfio.conf
- contents:
compression: ""
source: data:,vfio-pci
mode: 420
overwrite: true
path: /etc/modules-load.d/vfio-pci.conf
kernelArguments:
- intel_iommu=on
Apply the MachineConfigs (This will reboot hosts)
[dave@lenovo worker]$ oc create -f nvidia-iommu-vfio-worker-trt2.yaml
machineconfig.machineconfiguration.openshift.io/nvidia-iommu-vfio-worker-trt2 created
[dave@lenovo worker]$ oc create -f nvidia-iommu-vfio-master-r730.yaml
machineconfig.machineconfiguration.openshift.io/nvidia-iommu-vfio-master-r730 created
Once host reboots, verify IOMMU and vFIO
Note: Disregard HD Audio device entries and if the kernel driver in use is listed as nvidia, then go back and check PCI device IDs as the Nvidia operator must be installed and the host driver is controlling the device which will not allow passthrough to work.
[core@r730ocp4 ~]$ lspci -nnk -d 10de:
82:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP104 [GeForce GTX 1070 Ti] [10de:1b82] (rev a1)
Subsystem: eVga.com. Corp. Device [3842:5671]
Kernel driver in use: vfio-pci
Kernel modules: nouveau
[core@r730ocp4 ~]$ dmesg | grep -i -e DMAR -e NVIDIA
[ 0.000000] DMAR: IOMMU enabled
[ 0.001004] DMAR: Host address width 46
[ 0.002001] DMAR: DRHD base: 0x000000fbffc000 flags: 0x0
[ 0.003005] DMAR: dmar0: reg_base_addr fbffc000 ver 1:0 cap d2078c106f0466 ecap f020df
[ 0.004001] DMAR: DRHD base: 0x000000c7ffc000 flags: 0x1
[ 0.005004] DMAR: dmar1: reg_base_addr c7ffc000 ver 1:0 cap d2078c106f0466 ecap f020df
[ 0.006001] DMAR: ATSR flags: 0x0
[ 0.007002] DMAR: ATSR flags: 0x0
[ 0.008003] DMAR-IR: IOAPIC id 10 under DRHD base 0xfbffc000 IOMMU 0
[ 0.009001] DMAR-IR: IOAPIC id 8 under DRHD base 0xc7ffc000 IOMMU 1
[ 0.010001] DMAR-IR: IOAPIC id 9 under DRHD base 0xc7ffc000 IOMMU 1
[ 0.011001] DMAR-IR: HPET id 0 under DRHD base 0xc7ffc000
[ 0.012001] DMAR-IR: x2apic is disabled because BIOS sets x2apic opt out bit.
[ 0.012002] DMAR-IR: Use 'intremap=no_x2apic_optout' to override the BIOS setting.
[ 0.015122] DMAR-IR: Enabled IRQ remapping in xapic mode
[ 5.373467] DMAR: No RMRR found
[ 5.376976] DMAR: dmar0: Using Queued invalidation
[ 5.382327] DMAR: dmar1: Using Queued invalidation
[ 6.476324] DMAR: Intel(R) Virtualization Technology for Directed I/O
[ 30.146298] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:80/0000:80:02.0/0000:82:00.1/sound/card0/input5
[ 30.146348] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:80/0000:80:02.0/0000:82:00.1/sound/card0/input6
[ 30.146386] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:80/0000:80:02.0/0000:82:00.1/sound/card0/input7
[ 30.146424] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:80/0000:80:02.0/0000:82:00.1/sound/card0/input8
[ 30.146459] input: HDA NVidia HDMI/DP,pcm=10 as /devices/pci0000:80/0000:80:02.0/0000:82:00.1/sound/card0/input9
[ 30.146495] input: HDA NVidia HDMI/DP,pcm=11 as /devices/pci0000:80/0000:80:02.0/0000:82:00.1/sound/card0/input10
[ 30.146533] input: HDA NVidia HDMI/DP,pcm=12 as /devices/pci0000:80/0000:80:02.0/0000:82:00.1/sound/card0/input11
Update the HyperConverged Custom Resource, so that all GPU/vGPU devices in your cluster are permitted and can be assigned to OpenShift Virtualization VMs
Note: pciDeviceSelector
does not match some docs pciVendorSelector
, also you can name the devices whatever you'd like as it assigns by the pciDeviceSelector
. Additionally, it helps if all of your nodes are populated with cards in the same slot. Mine are not, and that is why the list of device selectors is so long. This includes card device selectors from my R730, and TRT2 servers.
apiVersion: hco.kubevirt.io/v1beta1
kind: HyperConverged
metadata:
name: kubevirt-hyperconverged
namespace: openshift-cnv
spec:
permittedHostDevices:
pciHostDevices:
- pciDeviceSelector: "10de:1bb3"
resourceName: "nvidia.com/GP104GL_Tesla_P4"
- pciDeviceSelector: "10de:1b82"
resourceName: "nvidia.com/GP104_GeForce_GTX_1070_Ti"
- pciDeviceSelector: "10de:1b80"
resourceName: "nvidia.com/GP104_GeForce_GTX_1080"
- pciDeviceSelector: "10de:2235"
resourceName: "nvidia.com/GA102GL_A40"
Apply the edit to kubevirt-hyperconverged
[dave@lenovo]$ oc edit hyperconverged kubevirt-hyperconverged -n openshift-cnv
hyperconverged.hco.kubevirt.io/kubevirt-hyperconverged edited
Note: YAML or OCP GUI, if GUI make sure to click the little check mark!
spec:
template:
spec:
domain:
devices:
hostDevices:
- deviceName: nvidia.com/GP104GL_Tesla_P4
name: gpu1
Note: Drivers must be loaded in the VM now:
VM is not Schedulable
Note: You may encounter the below if the Nvidia Operator is still trying to claim the GPU or if the device naming used different case.
status:
conditions:
- lastProbeTime: null
lastTransitionTime: '2022-11-22T18:05:26Z'
message: >-
0/5 nodes are available: 4 node(s) didn't match Pod's node
affinity/selector, 5 Insufficient nvidia.com/GP104GL_Tesla_P4.
preemption: 0/5 nodes are available: 1 No preemption victims found for
incoming pod, 4 Preemption is not helpful for scheduling.
reason: Unschedulable
status: 'False'
type: PodScheduled
The node has no allocatable devices (Notice device name case):
[dave@lenovo worker]$ oc describe node r730ocp3.localdomain
Name: r730ocp3.localdomain
Roles: master,worker
Capacity:
cpu: 48
devices.kubevirt.io/kvm: 1k
devices.kubevirt.io/sev: 1k
devices.kubevirt.io/tun: 1k
devices.kubevirt.io/vhost-net: 1k
ephemeral-storage: 975688684Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
k8s.kuartis.com/vgpu: 0
memory: 197807548Ki
nvidia.com/GP104GL_TESLA_P4: 0
nvidia.com/GP104_GEFORCE_GTX_1070_TI: 0
nvidia.com/gpu: 2
pods: 250
Allocatable:
cpu: 47500m
devices.kubevirt.io/kvm: 1k
devices.kubevirt.io/sev: 0
devices.kubevirt.io/tun: 1k
devices.kubevirt.io/vhost-net: 1k
ephemeral-storage: 899194689686
hugepages-1Gi: 0
hugepages-2Mi: 0
k8s.kuartis.com/vgpu: 0
memory: 196656572Ki
nvidia.com/GP104GL_TESLA_P4: 0
nvidia.com/GP104_GEFORCE_GTX_1070_TI: 0
nvidia.com/gpu: 0
pods: 250
After removing Nvidia Operator and rebooting node:
Capacity:
cpu: 48
devices.kubevirt.io/kvm: 1k
devices.kubevirt.io/sev: 1k
devices.kubevirt.io/tun: 1k
devices.kubevirt.io/vhost-net: 1k
ephemeral-storage: 975688684Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
k8s.kuartis.com/vgpu: 0
memory: 197807548Ki
nvidia.com/GP104GL_TESLA_P4: 0
nvidia.com/GP104GL_Tesla_P4: 1
nvidia.com/GP104_GEFORCE_GTX_1070_TI: 0
nvidia.com/GP104_GeForce_GTX_1070_Ti: 1
nvidia.com/gpu: 0
pods: 250
Allocatable:
cpu: 47500m
devices.kubevirt.io/kvm: 1k
devices.kubevirt.io/sev: 0
devices.kubevirt.io/tun: 1k
devices.kubevirt.io/vhost-net: 1k
ephemeral-storage: 899194689686
hugepages-1Gi: 0
hugepages-2Mi: 0
k8s.kuartis.com/vgpu: 0
memory: 196656572Ki
nvidia.com/GP104GL_TESLA_P4: 0
nvidia.com/GP104GL_Tesla_P4: 1
nvidia.com/GP104_GEFORCE_GTX_1070_TI: 0
nvidia.com/GP104_GeForce_GTX_1070_Ti: 1
nvidia.com/gpu: 0
pods: 250