Skip to content

Instantly share code, notes, and snippets.

@justindav1s
Last active May 14, 2018 20:41
Show Gist options
  • Save justindav1s/4f74b7cb6de09839f313cd7525224401 to your computer and use it in GitHub Desktop.
Save justindav1s/4f74b7cb6de09839f313cd7525224401 to your computer and use it in GitHub Desktop.
Host a RHEL7.4 Guest VM on Ubuntu 18.04 with PCI passthrough for NVIDIA GPU for Deep Learning on nvidia-docker and Openshift
Links
https://medium.com/@calerogers/gpu-virtualization-with-kvm-qemu-63ca98a6a172
http://vfio.blogspot.co.uk/2015/05/vfio-gpu-how-to-series-part-1-hardware.html
#####HOST#######
Ubuntu 18.04 with functioning NVIDIA 1050 Ti.
Docker-nvidia and GPU based tensorflow all work well.
justin@justin-ubuntu:~$ lspci
00:00.0 Host bridge: Intel Corporation Intel Kaby Lake Host Bridge (rev 05)
00:01.0 PCI bridge: Intel Corporation Skylake PCIe Controller (x16) (rev 05)
00:08.0 System peripheral: Intel Corporation Skylake Gaussian Mixture Model
00:14.0 USB controller: Intel Corporation 200 Series PCH USB 3.0 xHCI Controller
00:14.2 Signal processing controller: Intel Corporation 200 Series PCH Thermal Subsystem
00:16.0 Communication controller: Intel Corporation 200 Series PCH CSME HECI #1
00:17.0 SATA controller: Intel Corporation 200 Series PCH SATA controller [AHCI mode]
00:1c.0 PCI bridge: Intel Corporation 200 Series PCH PCI Express Root Port #1 (rev f0)
00:1c.4 PCI bridge: Intel Corporation 200 Series PCH PCI Express Root Port #5 (rev f0)
00:1c.6 PCI bridge: Intel Corporation 200 Series PCH PCI Express Root Port #7 (rev f0)
00:1d.0 PCI bridge: Intel Corporation 200 Series PCH PCI Express Root Port #9 (rev f0)
00:1f.0 ISA bridge: Intel Corporation 200 Series PCH LPC Controller (Z270)
00:1f.2 Memory controller: Intel Corporation 200 Series PCH PMC
00:1f.3 Audio device: Intel Corporation 200 Series PCH HD Audio
00:1f.4 SMBus: Intel Corporation 200 Series PCH SMBus Controller
00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (2) I219-V
01:00.0 VGA compatible controller: NVIDIA Corporation GP107 [GeForce GTX 1050 Ti] (rev a1)
01:00.1 Audio device: NVIDIA Corporation GP107GL High Definition Audio Controller (rev a1)
03:00.0 USB controller: ASMedia Technology Inc. Device 2142
04:00.0 Network controller: Realtek Semiconductor Co., Ltd. RTL8192EE PCIe Wireless Network Adapter
05:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller SM961/PM961
justin@justin-ubuntu:~$ nvidia-smi
Sat May 12 09:11:46 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.48 Driver Version: 390.48 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 105... Off | 00000000:01:00.0 On | N/A |
| 0% 38C P8 N/A / 90W | 3814MiB / 4038MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 1726 G /usr/lib/xorg/Xorg 24MiB |
| 0 1821 G /usr/bin/gnome-shell 48MiB |
| 0 2939 G /usr/lib/xorg/Xorg 203MiB |
| 0 3083 G /usr/bin/gnome-shell 177MiB |
| 0 3568 G ...-token=03EC8ADDFA5A61CA5607DDD3A8C603D3 63MiB |
| 0 4607 G gnome-control-center 1MiB |
| 0 23089 C /usr/bin/python 3281MiB |
+-----------------------------------------------------------------------------+
justin@justin-ubuntu:~$
root@justin-ubuntu:/etc/initramfs-tools# lspci -nnk | grep -i nvidia
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP107 [GeForce GTX 1050 Ti] [10de:1c82] (rev a1)
Kernel driver in use: nvidia
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
01:00.1 Audio device [0403]: NVIDIA Corporation GP107GL High Definition Audio Controller [10de:0fb9] (rev a1)
CHANGES
root@justin-ubuntu:/etc/default# diff grub.backup grub
12c12
< GRUB_CMDLINE_LINUX=""
---
> GRUB_CMDLINE_LINUX="intel_iommu=on iommu=pt rd.driver.pre=vfio-pci"
root@justin-ubuntu:/etc/initramfs-tools# diff modules.backup modules
11a12,17
>
> # Added for PCI passthrough for NVidia card
> vfio
> vfio_iommu_type1
> vfio_pci
> vfio_virqfd
New File :
root@justin-ubuntu:/etc/modprobe.d# cat /etc/modprobe.d/local.conf
options vfio-pci ids=10de:1c82,10de:0fb9
options vfio-pci disable_vga=1
update-initramfs -u
lspci -nnk
...
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GP107 [GeForce GTX 1050 Ti] [10de:1c82] (rev a1)
Subsystem: Micro-Star International Co., Ltd. [MSI] GP107 [GeForce GTX 1050 Ti] [1462:3351]
Kernel driver in use: vfio-pci
Kernel modules: nvidiafb, nouveau
01:00.1 Audio device [0403]: NVIDIA Corporation GP107GL High Definition Audio Controller [10de:0fb9] (rev a1)
Subsystem: Micro-Star International Co., Ltd. [MSI] GP107GL High Definition Audio Controller [1462:3351]
Kernel driver in use: vfio-pci
Kernel modules: snd_hda_intel
...
#######GUEST : RHEL7.4 VM config with virt-manager#######
From "Add Hardware " : Add a PCI Host Device, select your GPU and the associated sound device
Boot VM
before attempting to install the nvidia driver, prevent the driver from discovering its running in on kvm
add
<kvm>
<hidden state='on'/>
</kvm>
virsh edit <vm_name>
<features>
<acpi/>
<apic/>
<kvm>
<hidden state='on'/>
</kvm>
<vmport state='off'/>
</features>
Subscribe, then setup some repos for later :
subscription-manager repos --enable=rhel-7-server-extras-rpms
subscription-manager repos --disable=rhel-7-server-htb-rpms
Update everything etc.
yum -y update
also :
yum install gcc kernel-devel wget pci-utils yum-utils
and also :
wget http://dl.fedoraproject.org/pub/epel/7/x86_64/Packages/e/epel-release-7-11.noarch.rpm
rpm -Uvh epel-release*rpm
reboot
To see PCI devices :
lspci -nnk
initially you get the default NVIDIA driver : nouveau
...
00:08.0 VGA compatible controller [0300]: NVIDIA Corporation GP107 [GeForce GTX 1050 Ti] [10de:1c82] (rev a1)
Subsystem: Micro-Star International Co., Ltd. [MSI] Device [1462:3351]
Kernel driver in use: nouveau
Kernel modules: nouveau
00:09.0 Audio device [0403]: NVIDIA Corporation GP107GL High Definition Audio Controller [10de:0fb9] (rev a1)
Subsystem: Micro-Star International Co., Ltd. [MSI] Device [1462:3351]
Kernel driver in use: snd_hda_intel
Kernel modules: snd_hda_intel
...
Remove nouveau driver & NVIDIA driver installation
follow this :
https://access.redhat.com/solutions/1155663
or perhaps better, this :
https://blog.openshift.com/use-gpus-with-device-plugin-in-openshift-3-9/
Edit /etc/default/grub and add the following to the GRUB_CMDLINE_LINUX line:
modprobe.blacklist=nouveau
# grub2-mkconfig -o /boot/grub2/grub.cfg
# reboot
after removing nouveau :
...
00:08.0 VGA compatible controller [0300]: NVIDIA Corporation GP107 [GeForce GTX 1050 Ti] [10de:1c82] (rev a1)
Subsystem: Micro-Star International Co., Ltd. [MSI] Device [1462:3351]
Kernel modules: nouveau
00:09.0 Audio device [0403]: NVIDIA Corporation GP107GL High Definition Audio Controller [10de:0fb9] (rev a1)
Subsystem: Micro-Star International Co., Ltd. [MSI] Device [1462:3351]
Kernel driver in use: snd_hda_intel
Kernel modules: snd_hda_intel
...
After driver build install and reboot :
00:08.0 VGA compatible controller [0300]: NVIDIA Corporation GP107 [GeForce GTX 1050 Ti] [10de:1c82] (rev a1)
Subsystem: Micro-Star International Co., Ltd. [MSI] Device [1462:3351]
Kernel driver in use: nvidia
Kernel modules: nouveau, nvidia_drm, nvidia
00:09.0 Audio device [0403]: NVIDIA Corporation GP107GL High Definition Audio Controller [10de:0fb9] (rev a1)
Subsystem: Micro-Star International Co., Ltd. [MSI] Device [1462:3351]
Kernel driver in use: snd_hda_intel
Kernel modules: snd_hda_intel
[justin@localhost ~]$ nvidia-smi
Sat May 12 14:33:36 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.48 Driver Version: 390.48 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 105... Off | 00000000:00:08.0 Off | N/A |
| 0% 41C P0 N/A / 90W | 0MiB / 4040MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Doing useful ML stuff with ML, two options :
####1. Install nvidia-docker
nvidia-docker requires docker-ce :
https://stackoverflow.com/questions/42981114/install-docker-ce-17-03-on-rhel7
docker-ce requires pigz
https://centos.pkgs.org/7/epel-x86_64/pigz-2.3.4-1.el7.x86_64.rpm.html
wget http://dl.fedoraproject.org/pub/epel/7/x86_64/Packages/e/epel-release-7-11.noarch.rpm
rpm -Uvh epel-release*rpm
yum install pigz
yum install yum-utils
yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo
subscription-manager repos --enable=rhel-7-server-extras-rpms
yum install docker-ce
https://github.com/NVIDIA/nvidia-docker
# Add the package repositories
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
# Install nvidia-docker2 and reload the Docker daemon configuration
sudo yum install -y nvidia-docker2
systemctl enable docker
systemctl start docker
docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi
###Tensorflow
https://github.com/tensorflow/tensorflow/tree/master/tensorflow/tools/docker
do :
nvidia-docker run -it -p 8888:8888 tensorflow/tensorflow:latest-gpu
gives you jupyter notebook.
for tflearn run in first cell
import sys
!{sys.executable} -m pip install tflearn
********2. Openshift with NVIDIA device plugin
For GPU enabled Openshift, this :
https://blog.openshift.com/use-gpus-with-device-plugin-in-openshift-3-9/
got this when deploying caffe pod
Error from server (Forbidden): error when creating "caffe2.yaml": pods "caffe2" is forbidden: unable to validate against any security context constraint: [provider restricted: .spec.securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used spec.containers[0].securityContext.privileged: Invalid value: true: Privileged containers are not allowed spec.containers[0].securityContext.hostNetwork: Invalid value: true: Host network is not allowed to be used spec.containers[0].securityContext.containers[0].hostPort: Invalid value: 8888: Host ports are not allowed to be used]
Solution
https://adam.younglogic.com/2017/06/creating-a-privileged-container-in-openshift/
Deploying Jupyter/Tensorflow/GPU
https://github.com/justindav1s/openshift-ansible-on-openstack/tree/master/nvidia
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment