ismarsantos/tutorial.md

## tutorial.md

      
    Raw
  

              tutorial.md
            
          
    We have an A5000 board,
I made the following configuration for nvidia-smi gpu command to work on host:
!!! Disclaimer !!!
The information below is not my own. I found all the commands already ready on the internet, read carefully all the steps before executing, researched each one of them before, what it does, its type of hardware, and its current configuration if supported, and backed up first.
This tutorial is incomplete, I just made the GPU configuration on the host to make it work, but the mdevctl types command that displays the profiles doesn’t return anything, according to other tutorials I found it should return, that’s all I got so far.
Packages

Make sure to add the community pve repo and get rid of the enterprise repo (you can skip this step if you have a valid enterprise subscription)
echo "deb http://download.proxmox.com/debian/pve bullseye pve-no-subscription" >> /etc/apt/sources.list
rm /etc/apt/sources.list.d/pve-enterprise.list

Update and upgrade
apt update
apt dist-upgrade

Install dependencies

apt update && apt upgrade -y
apt install -y build-essential pve-headers-`uname -r` dkms jq cargo mdevctl unzip uuid

Configure PCI-Passthrough + IOMMU

https://pve.proxmox.com/wiki/Pci_passthrough
nano /etc/default/grub

For Intel CPU’s edit this line
GRUB_CMDLINE_LINUX_DEFAULT="quiet intel_iommu=on iommu=pt"

For AMD CPU’s edit this line
GRUB_CMDLINE_LINUX_DEFAULT="quiet amd_iommu=on iommu=pt"

Save file and update grub
update-grub

systemd-boot

The kernel parameters have to be appended to the commandline in the file /etc/kernel/cmdline, so open that in your favorite editor:
nano /etc/kernel/cmdline

On a clean installation the file might look similar to this:
root=ZFS=rpool/ROOT/pve-1 boot=zfs

On Intel systems, append this at the end
intel_iommu=on iommu=pt

For AMD, use this
amd_iommu=on iommu=pt

After editing the file, it should look similar to this
root=ZFS=rpool/ROOT/pve-1 boot=zfs intel_iommu=on iommu=pt

Now, save and exit from the editor using Ctrl+O and then Ctrl+X and then apply your changes:
proxmox-boot-tool refresh

Load VFIO modules at boot

/etc/modules-load.d/modules.conf

Insert these lines
vfio
vfio_iommu_type1
vfio_pci
vfio_virqfd

Create a couple of files in modprobe.d
echo "options vfio_iommu_type1 allow_unsafe_interrupts=1" > /etc/modprobe.d/iommu_unsafe_interrupts.conf
echo "options kvm ignore_msrs=1" > /etc/modprobe.d/kvm.conf
echo "blacklist nouveau" >> /etc/modprobe.d/blacklist.conf

Update initramfs
update-initramfs -u -k all

Reboot Proxmox
reboot

And verify that IOMMU is enabled
dmesg | grep -e DMAR -e IOMMU

Example output
[    1.121863] pci 0000:c0:00.2: AMD-Vi: IOMMU performance counters supported
[    1.121888] pci 0000:80:00.2: AMD-Vi: IOMMU performance counters supported
[    1.121906] pci 0000:40:00.2: AMD-Vi: IOMMU performance counters supported
[    1.121927] pci 0000:00:00.2: AMD-Vi: IOMMU performance counters supported
[    1.148549] pci 0000:c0:00.2: AMD-Vi: Found IOMMU cap 0x40
[    1.148566] pci 0000:80:00.2: AMD-Vi: Found IOMMU cap 0x40
[    1.148575] pci 0000:40:00.2: AMD-Vi: Found IOMMU cap 0x40
[    1.148582] pci 0000:00:00.2: AMD-Vi: Found IOMMU cap 0x40
[    1.150154] perf/amd_iommu: Detected AMD IOMMU #0 (2 banks, 4 counters/bank).
[    1.150162] perf/amd_iommu: Detected AMD IOMMU #1 (2 banks, 4 counters/bank).
[    1.150170] perf/amd_iommu: Detected AMD IOMMU #2 (2 banks, 4 counters/bank).
[    1.150180] perf/amd_iommu: Detected AMD IOMMU #3 (2 banks, 4 counters/bank).

Nvidia Driver

As of the time of this writing (August 2022), the latest available GRID driver is 14.2 with vGPU driver version 510.85.03. You can check for the latest version here. I cannot guarantee that newer versions would work, this tutorial only covers 14.2 (510.85.03).
The file you are looking for is called NVIDIA-GRID-Linux-KVM-510.85.03-510.85.02-513.46.zip, you can get it from the download portal by downloading version 14.2 for Linux KVM.
After downloading, extract that and copy the file NVIDIA-Linux-x86_64-510.85.03-vgpu-kvm.run to your Proxmox host into the /root/ folder
chmod +x ./NVIDIA-Linux-x86_64-510.85.03-vgpu-kvm.run

Installing the driver

./NVIDIA-Linux-x86_64-510.85.03-vgpu-kvm.run --dkms


The installer will ask you Would you like to register the kernel module sources with DKMS? This will allow DKMS to automatically build a new module, if you install a different kernel later., answer with Yes.


Installation of the NVIDIA Accelerated Graphics Driver for Linux-x86_64 (version: 510.85.03) is now complete.

Click Ok to exit the installer.
To finish the installation, reboot.
reboot

Finishing touches

Wait for your server to reboot, then type this into the shell to check if the driver install worked
nvidia-smi

You should get an output similar to this one
Sat Aug 13 12:40:00 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.85.03    Driver Version: 510.85.03    CUDA Version: N/A      |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A5000    On   | 00000000:41:00.0 Off |                  Off |
| 30%   33C    P8    24W / 230W |      0MiB / 24564MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

To test the nvidia-smi command:
nvidia-smi vgpu

Example output
Sat Aug 13 12:38:01 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.85.03              Driver Version: 510.85.03                 |
|---------------------------------+------------------------------+------------+
| GPU  Name                       | Bus-Id                       | GPU-Util   |
|      vGPU ID     Name           | VM ID     VM Name            | vGPU-Util  |
|=================================+==============================+============|
|   0  NVIDIA RTX A5000           | 00000000:41:00.0             |   0%       |
+---------------------------------+------------------------------+------------+

nvidia-smi vgpu -s

GPU 00000000:41:00.0
    NVIDIA RTXA5000-1B
    NVIDIA RTXA5000-2B
    NVIDIA RTXA5000-1Q
    NVIDIA RTXA5000-2Q
    NVIDIA RTXA5000-3Q
    NVIDIA RTXA5000-4Q
    NVIDIA RTXA5000-6Q
    NVIDIA RTXA5000-8Q
    NVIDIA RTXA5000-12Q
    NVIDIA RTXA5000-24Q
    NVIDIA RTXA5000-1A
    NVIDIA RTXA5000-2A
    NVIDIA RTXA5000-3A
    NVIDIA RTXA5000-4A
    NVIDIA RTXA5000-6A
    NVIDIA RTXA5000-8A
    NVIDIA RTXA5000-12A
    NVIDIA RTXA5000-24A
    NVIDIA RTXA5000-4C
    NVIDIA RTXA5000-6C
    NVIDIA RTXA5000-8C
    NVIDIA RTXA5000-12C
    NVIDIA RTXA5000-24C

Here you see a lot of different types your GPU offers and are split up into 4 distinct types
Type	Intended purpose
A	Virtual Applications (vApps)
B	Virtual Desktops (vPC)
C	AI/Machine Learning/Training (vCS or vWS)
Q	Virtual Workstations (vWS)
The type Q profile is most likely the type you want to use since it enables the possibility to fully utilize the GPU using a remote desktop (eg Parsec). The next step is selecting the right Q profile for your GPU. This is highly dependent on the available VRAM your GPU offers. So my RTX A5000 has 24GB of VRAM and i want to create 6 vGPU’s. Then i would choose a profile that has 4GB of VRAM. (24GB / 6 vGPU’s = 4GB). RTXA5000-4Q
Grid vgpu user guide
the command nvidia-smi works normally now.
Now I need help to proceed with profiling and getting the proxmox host to find the vGPU.

Create vGPU Profiles with UUID

#TODO

...
Adding a vGPU to a Proxmox VM

There is only one thing you have to do from the commandline: Open the VM config file and give the VM a uuid.
For that you need your VM ID, in this example I'm using 100.
nano /etc/pve/qemu-server/<VM-ID>.conf

So with the VM ID 100, I have to do this:
nano /etc/pve/qemu-server/100.conf

In that file, you have to add a new line at the end:
args: -uuid 00000000-0000-0000-0000-00000000XXX

You have to replace XXXX with your VM ID. With my 100 ID I have to use this line:
args: -uuid 00000000-0000-0000-0000-00000000100

Save and exit from the editor. Thats all you have to do from the terminal.