philipz/installing_cuda_on_azure_nc_tesla_k80_ubuntu.md

## installing_cuda_on_azure_nc_tesla_k80_ubuntu.md

      
    Raw
  

              installing_cuda_on_azure_nc_tesla_k80_ubuntu.md
            
          
    Installing NVIDIA CUDA on Azure NC with Tesla K80 and Ubuntu 16.04

State as of 2017-07-31.
You can also check a guide to upgrade CUDA on a [PC with with GTX 980 Ti and
Ubuntu 16.04](https://gist.github.com/bzamecnik/61b293a3891e166797491f38d579d060.
Target versions


NVIDIA driver 375.66

latest is 384.59 (2017.7.28) - I haven't not tried yet


CUDA Toolkit 8.0
cuDNN 5.1

latest is 6.0, but not supported by TensorFlow 1.2.1


We'll see how to install individual components and also that that we can install
all with just one reboot. In total it takes around 3 GB of disk space.
Some docs


https://docs.microsoft.com/en-us/azure/virtual-machines/linux/n-series-driver-setup
https://askubuntu.com/questions/886445/how-do-i-properly-install-cuda-8-on-an-azure-vm-running-ubuntu-14-04-lts

What GPU do we have?

Tested on Azure NC6 with 1x Tesla K80.
$ lspci | grep -i NVIDIA
a450:00:00.0 3D controller [0302]: NVIDIA Corporation GK210GL [Tesla K80] [10de:102d] (rev a1)

NVIDIA drivers

NOTE: Removing the nouveau driver is not necessary, installation of cuda-drivers do that automatically:
Setting up nvidia-375 (375.66-0ubuntu1) ...
update-alternatives: using /usr/lib/nvidia-375/ld.so.conf to provide /etc/ld.so.conf.d/x86_64-linux-gnu_GL.conf (x86_64-linux-gnu_gl_conf) in auto mode
update-alternatives: using /usr/lib/nvidia-375/ld.so.conf to provide /etc/ld.so.conf.d/x86_64-linux-gnu_EGL.conf (x86_64-linux-gnu_egl_conf) in auto mode
update-alternatives: using /usr/lib/nvidia-375/alt_ld.so.conf to provide /etc/ld.so.conf.d/i386-linux-gnu_GL.conf (i386-linux-gnu_gl_conf) in auto mode
update-alternatives: using /usr/lib/nvidia-375/alt_ld.so.conf to provide /etc/ld.so.conf.d/i386-linux-gnu_EGL.conf (i386-linux-gnu_egl_conf) in auto mode
update-alternatives: using /usr/share/nvidia-375/glamor.conf to provide /usr/share/X11/xorg.conf.d/glamoregl.conf (glamor_conf) in auto mode
update-initramfs: deferring update (trigger activated)

A modprobe blacklist file has been created at /etc/modprobe.d to prevent Nouveau from loading. This can be reverted by deleting /etc/modprobe.d/nvidia-graphics-drivers.conf.
A new initrd image has also been created. To revert, please replace /boot/initrd-4.4.0-87-generic with /boot/initrd-$(uname -r)-backup.

*****************************************************************************
*** Reboot your computer and verify that the NVIDIA graphics driver can   ***
*** be loaded.                                                            ***
*****************************************************************************

We will install the NVIDIA Tesla Driver via deb package.
wget http://us.download.nvidia.com/tesla/375.66/nvidia-diag-driver-local-repo-ubuntu1604_375.66-1_amd64.deb
sudo dpkg -i nvidia-diag-driver-local-repo-ubuntu1604_375.66-1_amd64.deb
sudo apt-get update
sudo apt-get install cuda-drivers
sudo reboot

CUDA toolkit

https://developer.nvidia.com/cuda-downloads
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-repo-ubuntu1604_8.0.61-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu1604_8.0.61-1_amd64.deb
sudo apt-get update
sudo apt-get install cuda

cuDNN

TensorFlow 1.2.1 needs cuDNN 5.1 (not 6.0).
Needs to be downloaded via registered NVIDIA account.
https://developer.nvidia.com/rdp/cudnn-download
This can be downloaded from a browser and then copied to the target machine via SCP:
https://developer.nvidia.com/compute/machine-learning/cudnn/secure/v5.1/prod_20161129/8.0/libcudnn5_5.1.10-1+cuda8.0_amd64-deb
sudo dpkg -i libcudnn5_5.1.10-1+cuda8.0_amd64-deb

Add to ~/.profile:
export LD_LIBRARY_PATH=/usr/local/cuda/lib64/:/usr/lib/x86_64-linux-gnu/:$LD_LIBRARY_PATH
. ~/.profile

All can be installed together

Note that cuda-drivers install a lot of unnecessary X11 stuff (in total 3.5 GB!).
We can dependency on lightdm to save some space if we don't use GUI.
sudo dpkg -i nvidia-diag-driver-local-repo-ubuntu1604_375.66-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu1604_8.0.61-1_amd64.deb
sudo dpkg -i libcudnn5_5.1.10-1+cuda8.0_amd64-deb
sudo apt-get update
# this needs install 3.5 GB of dependencies
sudo apt-get install cuda-drivers cuda
# possible to remove lightdm and save 0.5 GB
sudo apt-get install cuda-drivers cuda lightdm-

Reboot

sudo reboot

Test that it's working

We should see the GPU infomation:
nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.66                 Driver Version: 375.66                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | A450:00:00.0     Off |                    0 |
| N/A   40C    P0    70W / 149W |      0MiB / 11439MiB |      2%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Let's run a simple "hello world" MNIST MLP in Keras/Tensorflow:
pip install tensorflow-gpu==1.2.1 keras==2.0.6
wget https://raw.githubusercontent.com/fchollet/keras/master/examples/mnist_mlp.py
python mnist_mlp.py

We should see that it uses the GPU and trains properly:
Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: a450:00:00.0)

That's it. Happy training!