bzamecnik/installing_cuda_gtx980ti_ubuntu_1604.md

## installing_cuda_gtx980ti_ubuntu_1604.md

      
    Raw
  

              installing_cuda_gtx980ti_ubuntu_1604.md
            
          
    Installing CUDA 8.0 on PC with with GTX 980 Ti and Ubuntu 16.04

State as of 2017-07-31.
Recently I was installing CUDA on a Azure NC6 VM with Tesla
K80
and later the same day I also upgraded my personal machine with GTX 980 Ti
from Ubuntu 15.10 to 16.04 and completely reinstalled CUDA.
I had NVIDIA driver 361, CUDA 7.5 and cuDNN 4 or 5 and wanted CUDA 8.0
for new TensorFlow 1.2.1. So I had to upgrade also Ubuntu.
Preparing the plan


remove old NVIDIA drivers, CUDA toolkit, cuDNN and related packages
install NVIDIA drivers, CUDA toolkit, cuDNN
verify it works OK

Target version:

NVIDIA driver: 375.82 (long-living stable)

latest: 384.59 (possibly OK, I took the stable version to be more sure)


CUDA toolkit: 8.0
cuDNN: 5.1

latest: 6.0 (not supported by latest TensorFlow 1.2.1, only by 1.3.0-rc1)


Secure boot

There's a change after upgrading Ubuntu from 15.10 that Ubuntu 16.04
uses Secure Boot mechanism to digitally sign kernel drivers
(originally used to protect Windows from viruses). Unfortunately
this doesn't work well with 3rd-party binary drivers such as NVIDIA.
The working option for me was to disable Secure Boot entirely, otherwise
Ubuntu is not able to load the drivers. The complication is that this
cannot be done over SSH, but at the boot time in the UEFI BIOS settings!
Since my computer is across the world, I had to call somebody to
modify the BIOS settings for me.
Example procedure for my motherboard ASUS Z170-E
(video):

restart, press F8 during boot to enter the bios
Boot -> Secure Boot (down) -> Key Management -> Clear Secure Boot Keys
Boot -> Secure Boot -> Secure Boot state - should be disabled
Exit -> Save changes & Reset
restart

I recommend to do this even before upgrading Ubuntu.
What GPU do we have?

PC with ASUS Z170-E motherboard and GTX 980 Ti GPU.
$ lspci | grep -i NVIDIA
01:00.0 VGA compatible controller: NVIDIA Corporation GM200 [GeForce GTX 980 Ti] (rev a1)

Upgrading Ubuntu to 16.04

It was quite easy:
sudo do-release-upgrade

And resolve a few conflicts in /etc files. If I could disable Secure Boot first it would be easier.
Took about an hour.
Removing old drivers

If NVIDIA driver was installed via deb package:
sudo apt-get remove --purge 'nvidia*' 'cuda*' 'libcuda1*' 'libcudnn5*' libxnvctrl0

If NVIDIA driver was installed via the *.run file:
sudo nvidia-uninstall

Installing NVIDIA driver

There are two options:

official installer: NVIDIA-Linux-x86_64-<version>.run

http://www.nvidia.com/Download/index.aspx
http://www.nvidia.com/download/driverResults.aspx/120917/en-us

only available version: 384.59, but 375.82 still can be downloaded directly


unofficial debian package (PPA): nvidia-375

Although NVIDIA recommends only the official installer it's so much a hassle to work
with and it didn't work well work me. The debian package (not recommended by NVIDIA)
worked like charm and without tons of questions. So my recommendation is to the the
debian package.
Debian package from PPA

sudo add-apt-repository ppa:graphics-drivers/ppa
sudo apt-get update
sudo apt-get install nvidia-375

NVIDIA installer

Although I don't recommend it, here are the steps for the NVIDIA installer:
NV_VERSION=375.82
wget http://us.download.nvidia.com/XFree86/Linux-x86_64/${NV_VERSION}/NVIDIA-Linux-x86_64-${NV_VERSION}.run
chmod +x NVIDIA-Linux-x86_64-${NV_VERSION}.run

If you have X11 server (GUI) running, you need to stop it during the installation.
## ERROR: You appear to be running an X server; please exit X before installing.
sudo service lightdm stop # display manager
sudo systemctl stop vncserver@1 # also stop VNC server if present
sudo init 3 # run level without GUI

sudo ./NVIDIA-Linux-x86_64-${NV_VERSION}.run

sudo reboot

CUDA toolkit

https://developer.nvidia.com/cuda-downloads
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-repo-ubuntu1604_8.0.61-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu1604_8.0.61-1_amd64.deb
sudo apt-get update
sudo apt-get install cuda

cuDNN

TensorFlow 1.2.1 needs cuDNN 5.1 (not 6.0).
Needs to be downloaded via registered NVIDIA account.
https://developer.nvidia.com/rdp/cudnn-download
This can be downloaded from a browser and then copied to the target machine via SCP:
https://developer.nvidia.com/compute/machine-learning/cudnn/secure/v5.1/prod_20161129/8.0/libcudnn5_5.1.10-1+cuda8.0_amd64-deb
sudo dpkg -i libcudnn5_5.1.10-1+cuda8.0_amd64-deb

Add to ~/.profile:
export LD_LIBRARY_PATH=/usr/local/cuda/lib64/:/usr/lib/x86_64-linux-gnu/:$LD_LIBRARY_PATH
. ~/.profile

All can be installed together

sudo apt-get remove --purge 'nvidia*' 'cuda*' 'libcuda1*' 'libcudnn5*' libxnvctrl0
sudo add-apt-repository ppa:graphics-drivers/ppa
sudo dpkg -i cuda-repo-ubuntu1604_8.0.61-1_amd64.deb
sudo dpkg -i libcudnn5_5.1.10-1+cuda8.0_amd64-deb
sudo apt-get update
sudo apt-get install nvidia-375 cuda

Reboot

sudo reboot

Test that it's working

We should see the GPU infomation:
nvidia-smi

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.82                 Driver Version: 375.82                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 980 Ti  Off  | 0000:01:00.0      On |                  N/A |
| 22%   39C    P8    19W / 250W |     92MiB /  6075MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0      1096    G   /usr/lib/xorg/Xorg                              53MiB |
|    0      2388    G   /usr/bin/gnome-shell                            37MiB |
+-----------------------------------------------------------------------------+

Let's run a simple "hello world" MNIST MLP in Keras/Tensorflow:
pip install tensorflow-gpu==1.2.1 keras==2.0.6
wget https://raw.githubusercontent.com/fchollet/keras/master/examples/mnist_mlp.py
python mnist_mlp.py

We should see that it uses the GPU and trains properly:
Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 980 Ti, pci bus id: 0000:01:00.0)

That's it. Happy training!