Installing NVIDIA Driver & CUDA inside an LXC container running Ubuntu 16.04 on a neuroscience computing server.
Introduction: I was trying to run some neuroscience image processing commands that uses NVIDIA GPU. The challenge is that most of our computation will be run inside an LXC container running Ubuntu 16.04 (the host runs Ubuntu 16.04 as well). Installing the NVIDIA driver on the host is not so hard, but doing it inside the LXC container is much more challenging.
I already have an unprivileged container running, so I will not repeat the steps to create an LXC container here.
Our graphics card is NVIDIA GeForce GTX 1080 Ti.
Here are the main steps:
- Install NVIDIA driver on the host
- Install NVIDIA driver in the container. The driver version in the container has to be exactly the same as the one on the host.
- Install CUDA & other GPU-related libraries in the container.
I found this page
which mostly followed the instructions on this page:
(see Section 4, Runfile Installation).
And here is what I did (mostly following steps in Section 4.2 from the link above, but I'm listing the steps that I actually did below):
gccand other essential packages on the host:
sudo apt install build-essential software-properties-common
- Download the CUDA Toolkit run file
- Following the instructions in Section 4.3.5 to blacklist
- Reboot athena into text-only mode: found this page https://askubuntu.com/questions/870221/booting-into-text-mode-in-16-04/870226 (From here on, I use the KVM environment.)
- Running into this error:
huangk04@athena:~/Downloads$ sudo sh cuda_9.0.176_384.81_linux-run [sudo] password for huangk04: Sorry, user huangk04 is not allowed to execute '/bin/sh cuda_9.0.176_384.81_linux-run' as root on athena.mssm.edu.
- Got past that error by typing
sudo suand then it runs.
- The root partition
/dev/mapper/vg01-lv.rootis too small for CUDA; will install CUDA in
/data/cuda-9.0and make symbolic links to
/usr/local/cuda-9.0; also install CUDA samples at
/data/cuda-9.0/samples/; also need to specify temp file directory because of disk space issue as well; so here is the command I executed:
sh cuda_9.0.176_384.81_linux-run --tmpdir=/data/tmp
(Mmm... I probably did not need to install CUDA Toolkit on the host...)
- Add the graphics driver PPA (verified that the driver version 384.98 is supported on Ubuntu 16.04), update, and then install driver version 384:
sudo add-apt-repository ppa:graphics-drivers/ppa sudo apt-get update sudo apt-get install nvidia-384
- Reboot, and then (still in host) type
nvidia-smito confirm that the driver version is indeed 384.98.
To set up GPU driver in the container, Prantik forwarded this tutorial to me: https://medium.com/@MARatsimbazafy/journey-to-deep-learning-nvidia-gpu-passthrough-to-lxc-container-97d0bc474957
- On the host, edit the file
/etc/modules-load.d/modules.confand add the following lines (not sure if this is necessary):
sudo update-initramfs -u
- Set the login runlevel back to
graphical.target(and another reboot is required):
sudo systemctl set-default graphical.target
- Edit the file
/home/huangk04/.local/share/lxc/athena_box/configand add the following lines to it:
# GPU Passthrough config lxc.cgroup.devices.allow = c 195:* rwm lxc.cgroup.devices.allow = c 243:* rwm lxc.mount.entry = /dev/nvidia0 dev/nvidia0 none bind,optional,create=file lxc.mount.entry = /dev/nvidiactl dev/nvidiactl none bind,optional,create=file lxc.mount.entry = /dev/nvidia-uvm dev/nvidia-uvm none bind,optional,create=file lxc.mount.entry = /dev/nvidia-modeset dev/nvidia-modeset none bind,optional,create=file lxc.mount.entry: /dev/nvidia-uvm-tools dev/nvidia-uvm-tools none bind,optional,create=file
However, passing the GPU device in the LXC config file led to the following error when I tried to start the container:
huangk04@athena:~$ lxc-start -n athena_box -d lxc-start: tools/lxc_start.c: main: 366 The container failed to sta rt. lxc-start: tools/lxc_start.c: main: 368 To get more details, run the container in foreground mode. lxc-start: tools/lxc_start.c: main: 370 Additional information can be obtained by setting the --logfile and --logpriority options.
I find that I can't start my container if I specify anything in the
config file that tries to modify the
cgroup settings, like trying to get access to the
/dev/nvidia* devices on the host.
It looks like a
cgroups issue with LXC on Ubuntu 16.04 (maybe somehow related to
systemd, but I don't really understand what that means). What's more confusing is that Ubuntu has a package
cgmanager that manages
cgroups (by being a wrapper to send calls to
dbus?), but when I tried to install it by typing
sudo apt update sudo apt install cgmanager
it showed that I installed the version
0.39-2ubuntu5. But the version of
cgm I got is
huangk04@athena:~$ cgm --version 0.29
Seems like a bug in
cgmanager. Anyway, I found some instructions (e.g., https://www.berthon.eu/2015/lxc-unprivileged-containers-on-ubuntu-14-04-lts/) that I could move all processes in my current shell to a specific
cgroup with access to the devices I that need, and then I may be able to start the container. So here is what I tried:
sudo cgm create all $USER sudo cgm chown all $USER $(id -u) $(id -g) sudo cgm movepid all $USER $$
In fact, the second and third commands actually threw me errors. But these commands did have an effect on what I see in
/proc/self/cgroup. Before these commands, it looks like this:
huangk04@athena:~$ cat /proc/self/cgroup 11:cpuset:/ 10:net_cls,net_prio:/ 9:cpu,cpuacct:/user.slice 8:perf_event:/ 7:memory:/user/huangk04/0 6:devices:/user.slice 5:freezer:/user/huangk04/0 4:hugetlb:/ 3:blkio:/user.slice 2:pids:/user.slice/user-10354.slice 1:name=systemd:/user.slice/user-10354.slice/session-6.scope
and after the three commands above (probably the first one only need to be run once), I see
huangk04@athena:~$ cat /proc/self/cgroup 11:cpuset:/huangk04 10:net_cls,net_prio:/huangk04 9:cpu,cpuacct:/user.slice/huangk04 8:perf_event:/huangk04 7:memory:/user/huangk04/0 6:devices:/user.slice/huangk04 5:freezer:/user/huangk04/0 4:hugetlb:/huangk04 3:blkio:/user.slice/huangk04 2:pids:/user.slice/user-10354.slice/huangk04 1:name=systemd:/user.slice/user-10354.slice/session-6.scope
and now the container starts. I suspect that it is a bug in
cgmanager that's throwing me errors even though the commands worked, which could be related to the incoherent version numbers I see when viewing them in different ways. Also, the
sudo cgm chown and
sudo cgm movepid are not persistent, meaning that I need to run these commands in the future if I need to restart the container (in a different shell, most likely).
- Install NVIDIA driver in container as well (so we have the
nvidia-smicommand in the container): First, download the driver runfile
NVIDIA-Linux-x86_64-384.98.run(again, the version in the container must match the version on the host, 384.98). Then do the following (courtesy of this website: https://qiita.com/yanoshi/items/75b0fc6b65df49fc2263)
cd ~/Downloads # or wherever the runfile is in chmod a+x NVIDIA-Linux-x86_64-384.98.run sudo ./NVIDIA-Linux-x86_64-384.98.run --no-kernel-module
And then follow the prompts to install the driver. After that, I can see the GPU info in the container by typing
root@xenial:/usr/local/cuda-9.0/samples/1_Utilities/deviceQuery# nvidia-smi Tue Nov 21 02:35:05 2017 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 384.98 Driver Version: 384.98 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 108... Off | 00000000:02:00.0 Off | N/A | | 23% 38C P0 59W / 250W | 0MiB / 11172MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
- Install CUDA in the container, following the same steps from steps 2 to 7.
- Run a CUDA test: go to the directory
maketo compile an executable, and run the executable
./deviceQuery, which produced the following output:
root@xenial:/usr/local/cuda-9.0/samples/1_Utilities/deviceQuery# ./deviceQuery ./deviceQuery Starting... CUDA Device Query (Runtime API) version (CUDART static linking) Detected 1 CUDA Capable device(s) Device 0: "GeForce GTX 1080 Ti" CUDA Driver Version / Runtime Version 9.0 / 9.0 CUDA Capability Major/Minor version number: 6.1 Total amount of global memory: 11172 MBytes (11715084288 bytes) (28) Multiprocessors, (128) CUDA Cores/MP: 3584 CUDA Cores GPU Max Clock rate: 1582 MHz (1.58 GHz) Memory Clock rate: 5505 Mhz Memory Bus Width: 352-bit L2 Cache Size: 2883584 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 2 copy engine(s) Run time limit on kernels: No Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled Device supports Unified Addressing (UVA): Yes Supports Cooperative Kernel Launch: Yes Supports MultiDevice Co-op Kernel Launch: Yes Device PCI Domain ID / Bus ID / location ID: 0 / 2 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime Version = 9.0, NumDevs = 1 Result = PASS
- It is best to remove the graphics driver PPA so that future
apt updatewon't update the driver to a newer but incompatible version:
sudo add-apt-repository --remove ppa:graphics-drivers/ppa
Should do this both on the host and in the container.
To be added: run more tests using the GPU
Here is the page where one can download the CUDA- and openMP-enabled versions of
FSL, in case I forget:
Also, here is a nice instruction on how to install CUDA 7.5 on Ubuntu 16.04: http://www.xgerrmann.com/uncategorized/building-cuda-7-5-on-ubuntu-16-04/
In the end,
eddy_cuda7.5 still doesn't run properly inside the container. I wonder if it's because I installed CUDA 9.0 before installing CUDA 7.5, even though I created an environment module file for CUDA 7.5 and loaded it before running
eddy_cuda7.5 seems to be able to find the correct libraries). I'll need to experiment more with this later.