claudiok/OpenStack_consumer_GPU_passthrough.md

## OpenStack_consumer_GPU_passthrough.md

      
    Raw
  

              OpenStack_consumer_GPU_passthrough.md
            
          
    Consumer-grade GPUs in an OpenStack system (NVIDIA GPUs)

Assumptions

This assumes you have GTX980 cards in your system (PCI id 10de:13c0 & 10de:0fbb per card). Just add more IDs for other cards in order to make this more generic. This also assumes nova uses qemu-kvm as the virtualization hypervisor (qemu-system-x86_64). This seems to be the default on OpenStack Newton when installed using openstack-ansible.
We assume OpenStack Newton is pre-installed and that we are working on a Nova compute node. This has been tested on an Ubuntu 16.04 system where I installed OpenStack AIO version 14.0.0 (different from the git tag used in the instructions!): http://docs.openstack.org/developer/openstack-ansible/developer-docs/quickstart-aio.html
Prepare the system for GPU passthrough (set up IOMMU/vfio/...)

Note: This is heavily based on information from https://wiki.archlinux.org/index.php/PCI_passthrough_via_OVMF#Enabling_IOMMU adapted for Ubuntu 16.04


Ensure SR-IOV and VT-d are enabled in your system BIOS.


add intel_iommu=on to the kernel command line (in /etc/default/grub)


run
$ update-grub


blacklist snd_hda_intel (which might grab the audio portion of the GPU on the host)
(also just blacklist all potential GPU modules while we are at it. Especially nouveau
is important here.)
Edit /etc/modprobe.d/blacklist.conf:
blacklist snd_hda_intel
blacklist amd76x_edac
blacklist vga16fb
blacklist nouveau
blacklist rivafb
blacklist nvidiafb
blacklist rivatv


Make the vfio-pci module hold on to the devices we might want to pass through (and
devices in the same iommu group). This mostly just means each GPU and its audio device
(even though we don't pass through the audio device). In this case PCI vendor ID 10de:13c0
is the main GPU and 10de:0fbb is its HDMI audio interface.
Create /etc/modprobe.d/vfio.conf:
# (GTX980 and its audio controller)
options vfio-pci ids=10de:13c0,10de:0fbb

Note: you can find all NVIDIA cards with their PCI vendor IDs in your system using something like this:
$ lspci -nn | grep NVIDIA


Make sure vfio-pci gets loaded as early as possible by editing /etc/modules-load.d/modules.conf
and adding vfio-pci to the list.


Update the initrd to apply these changes at boot by running
$ update-initramfs -u


Reboot the system in order to activate the intel_iommu=on kernel option.


Now make sure the GPUs and their audio interfaces are "in use" by vfio-pci and
not by any other module. Something like this should be what you see:
root@stack:~# lspci -nnk -d 10de:13c0
05:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM204 [GeForce GTX 980] [10de:13c0] (rev a1)
  Subsystem: eVga.com. Corp. GM204 [GeForce GTX 980] [3842:2980]
  Kernel driver in use: vfio-pci
  Kernel modules: nvidiafb, nouveau
84:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM204 [GeForce GTX 980] [10de:13c0] (rev a1)
  Subsystem: eVga.com. Corp. GM204 [GeForce GTX 980] [3842:2980]
  Kernel driver in use: vfio-pci
  Kernel modules: nvidiafb, nouveau
root@thunerstack:~# lspci -nnk -d 10de:0fbb
05:00.1 Audio device [0403]: NVIDIA Corporation GM204 High Definition Audio Controller [10de:0fbb] (rev a1)
  Subsystem: eVga.com. Corp. GM204 High Definition Audio Controller [3842:2980]
  Kernel driver in use: vfio-pci
  Kernel modules: snd_hda_intel
84:00.1 Audio device [0403]: NVIDIA Corporation GM204 High Definition Audio Controller [10de:0fbb] (rev a1)
  Subsystem: eVga.com. Corp. GM204 High Definition Audio Controller [3842:2980]
  Kernel driver in use: vfio-pci
  Kernel modules: snd_hda_intel


Your system should now be ready for PCI passthrough of its GPUs.


Configure Nova on the compute node and controller

Note: This is based on information from http://docs.openstack.org/admin-guide/compute-pci-passthrough.html


Add this to nova.conf on the controller, the api and compute hosts (create more aliases for various other GPU models).
Edit /etc/nova/nova.conf (on each system, compute, controller and api):
[default]
...
pci_alias = { "vendor_id":"10de", "product_id":"13c0", "device_type":"type-PCI", "name":"gtx980" }
pci_passthrough_whitelist = { "vendor_id": "10de", "product_id": "13c0" }
...

In the same file append ,PciPassthroughFilter to the scheduler_default_filters option
in /etc/nova/nova.conf:
# add this to scheduler_default_filters in /etc/nova/nova.conf
scheduler_default_filters = ..... ,PciPassthroughFilter


Restart nova-compute, nova-api and nova-scheduler, depending on the node:
$ systemctl restart nova-api
$ systemctl restart nova-scheduler
$ systemctl restart nova-compute


Then configure the a flavor as usual and finally add the GPU requirement to it
(1x gtx980 in this example)
$ openstack flavor set m1.large.1gtx980 --property "pci_passthrough:alias"="gtx980:1"

In this example gtx980 is the name chosen above and :1 means the flav
wants one of this resource. So in order to make it a 2-GPU flavor it would be gtx980:2.


Now GPU-passthrough should work. There is one last step to perform in order to make
NVIDIA consumer-grade GPUs usable in VMs. Apparently the NVIDIA driver checks if
it runs inside a VM and won't start up in case it is. This seems to be a "bug" that
NVIDIA probably does not intend to fix. In any case, KVM (in this case through qemu-kvm)
can be configured to hide the fact that the VM is running in KVM. I do not think this
can be directly changed in OpenStack/libvirtd, but one way of injecting the correct
options is to install this wrapper script around qemu:


Rename /usr/bin/qemu-system-x86_64 to /usr/bin/qemu-system-x86_64.orig
and deploy this wrapper as /usr/bin/qemu-system-x86_64 on the nova compute host.
#!/usr/bin/python

import os
import sys

new_args = []

# only change the "-cpu" options (inject kvm=off and hv_vendor_id=MyFake_KVM)
for i in range(len(sys.argv)):
    if i<=1: 
        new_args.append(sys.argv[i])
        continue
    if sys.argv[i-1] != "-cpu":
        new_args.append(sys.argv[i])
        continue

    subargs = sys.argv[i].split(",")

    subargs.insert(1,"kvm=off")
    subargs.insert(2,"hv_vendor_id=MyFake_KVM")

    new_arg = ",".join(subargs)

    new_args.append(new_arg)

os.execv('/usr/bin/qemu-system-x86_64.orig', new_args)


Add /usr/bin/qemu-system-x86_64.orig to /etc/apparmor.d/abstractions/libvirt-qemu as
/usr/bin/qemu-system-x86_64 rmix,

and reload apparmor
$ systemctl reload apparmor


This should be it. You should now be able to create GPU instances in your OpenStack cluster.