benjaminjack/cuda_config.md

## cuda_config.md

      
    Raw
  

              cuda_config.md
            
          
    This is a collection of notes that explains how to configure linux for hardware-accelerated deep learning in R. The goal is to have linux drive the main display from an Intel integrated GPU, while reserving an Nvidia GPU strictly for computation using CUDA. It skips some of the more straightforward steps while detailing any "gotchas" I ran into.
What we'll install


ElementaryOS Loki 0.4.1 (an Ubuntu 16.04 LTS derivative; so these notes apply to Ubuntu as well)
Latest Nvidia drivers
CUDA Toolkit
cuDNN
R/Rstudio
R Keras (which includes tensorflow)

A warning about version numbers

Before starting this whole process, check which CUDA versions are compatible with tensorflow and vice versa. Similarly, check which versions of cuDNN are compatible with the CUDA version that you're going to install. Here are all the versions that played nicely together for me:

Nvidia Drivers 390.25
CUDA 9.0
cuDNN v7.0.5 (Dec 5, 2017), for CUDA 9.0
tensorflow-gpu 1.6.0
Note that the latest version of CUDA (9.1) is NOT compatible with the latest version of tensorflow-gpu (1.6.0).

Install ElementaryOS


Before installation, configure your BIOS to use the integrated GPU as the default display driver
Connect display to integrated GPU HDMI (typically the motherboard HDMI, not external GPU HDMI)
Install ElementaryOS
Optional: Install Intel microcode drivers via apt-get (do not install any Nvidia drivers!)

Install the latest Nvidia drivers

We need the latest proprietery Nvidia drivers to use the CUDA libraries. The CUDA installer will ask if you want to install drivers as the first step, but the bundled drivers are out of date and failed to install for me. We'll install the latest drivers separately first, after a few setup steps.
Download drivers

Download the latest Nvidia drivers.
Blacklist nouveau

By default, when linux sees an Nvidia GPU, it will load nouveau, an open source Nvidia driver maintained by the linux community. Nouveau will conflict with the proprietary Nvidia drivers and we must block linux from loading nouveau. Create the following file:
/etc/modprobe.d/blacklist-nouveau.conf
# inside blacklist-nouveau.conf

blacklist nouveau
options nouveau modeset=0

Load the new blacklist and reboot:
sudo update-initramfs -u
sudo reboot

Temporarily disable the desktop environment

Launch into text-only mode by pressing Ctrl-Alt-F1 at the login screen. The Nvidia driver installer will complain if we attempt to install the driver while the desktop environment is active. Disable your desktop environment:
sudo service lightmd stop

Install driver and dependencies

If you manually install drivers and the linux kernel gets updated, those drivers will need to be recompiled. We can tell linux recompile the drivers automatically with dkms. This is generally a good idea and will keep the drivers from breaking when minor updates are made to the kernel. Install dkms:
sudo apt-get install dkms

Lastly, install the driver:
cd ~/Downloads/
chmod +x NVIDIA-Linux-x86_64-390.25.run 
sudo ./NVIDIA-Linux-x86_64-384.69.run --dkms --no-opengl-files

The --no-opengl-files flag blocks the Nvidia GPU from ever loading X-server. The --dkms flag enables automatic driver re-compilation as described above. The installer will ask several questions. Ignore any questions about an unsupported system (the installer won't recognize ElementaryOS, for example). Ignore the question about 32-bit systems (everything is 64-bit these days). Say NO when the installer asks to run nvidia-xconfig. We do not want the installer attempting to configure X-server to run on the GPU.
If the installation was successful, reboot, open a terminal and run nvidia-smi and you should see something like this:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.25                 Driver Version: 390.25                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 105...  Off  | 00000000:01:00.0 Off |                  N/A |
| 30%   27C    P0    N/A /  75W |      0MiB /  4040MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

If you see No running processes found, you've set up things correctly. Again, we don't want to see X-server running (or consuming memory) here.
Installing CUDA

The CUDA toolkit is really just a collection of libraries. There's no reason to install these via apt-get or a deb package. Just download the runfile from Nvidia's website.
Install CUDA libraries

Execute the CUDA toolkit runfile installer. When the installer asks you to install drivers say NO, otherwise you'll overwrite the drivers we just installed above.
cd ~/Downloads/
chmod +x cuda_9.0.176_384.81_linux.run
sudo ./cuda_9.0.176_384.81_linux.run

Link libraries and update PATH

On Ubuntu, library paths must be specified at the system level using ld.so.conf.d. Attempting to set these paths in .profile or .bashrc will work for the command line, but will fail in IDEs like Rstudio. Create the following file and add two lines specifying the library paths:
/etc/ld.so.conf.d/cuda.conf
/usr/local/cuda/lib64
/usr/local/cuda/extras/CUPTI/lib64

Then re-link the libraries:
sudo ldconfig

Now add update your PATH and add some CUDA-specific environment variables to .profile. These definitions must go in .profile  (not .bashrc) for an R session in Rstudio to load them correctly.
~/.profile
export PATH="/usr/local/cuda/bin:$PATH"
export CUDA_HOME=/usr/local/cuda
export CUDA_VISIBLE_DEVICES=0  # Only use GPU 0, this will differ for multi-GPU setups

Validating your CUDA installation

Run the following to compile a simple CUDA utility:
cd ~/NVIDIA_CUDA-9.0_Samples/1_Utilities/bandwidthTest/
make
./bandwidthTest 

If you see "PASS", you've successfully configured and installed CUDA.
Install cuDNN

cuDNN is an additional set of CUDA libraries specifically for deep neural networks. These libraries are a dependency for tensorflow, so we'll have to install them next.
tar xzvf cudnn-9.0-linux-x64-v7.tgz 
sudo cp -P cuda/include/cudnn.h /usr/local/cuda/include
sudo cp -P cuda/lib64/libcudnn* /usr/local/cuda/lib64
sudo chmod a+r /usr/local/cuda/include/cudnn.h /usr/local/cuda/lib64/libcudnn*

Install R and Rstudio

The version of R in the main Ubuntu repositories is ancient. Install the latest version of R directly from CRAN:
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E084DAB9
sudo apt-get update
sudo apt-get install r-base

Download the latest Rstudio *.deb package and install it:
sudo apt install gdebi-core
sudo gdebi rstudio-xenial-1.1.423-amd64.deb 

Install Keras in R

By default, Keras will attempt to install tensorflow in a python virtual environment. Make sure that you have pip and virtualenv installed:
sudo apt-get install python-pip python-dev python-virtualenv

Next, open up Rstudio, and in the R console run the following:
install.packages(keras)
install_keras(tensorflow="gpu")

Assuming you've configured your paths correctly, you should now have Keras with the GPU-accelerated tensorflow backend installed.
Troubleshooting

Problems I encountered and some workarounds.
R session crashes in Rstudio

If your R session in Rstudio unexpectedly crashes when running keras/tensorflow, open up a terminal and run the same commands in R from the terminal. R tensorflow should give you an informative error message when it crashes. In my case, I had installed a version of the cuDNN libraries not compatible with the version of tensorflow.
Nvidia driver installation errors

I initially tried to install the (somewhat older) drivers bundled with the CUDA installer. This failed no matter what I tried. Instead, I installed the latest drivers from Nvidia separately. This worked fine. I still have no idea why.
Acknowledgements and further readings

These blog posts were really helpful:

Installing CUDA 8.0 and cuDNN 5.1 on Ubuntu 16.04
Battling through an Nvidia GPU setup in the name of machine learning.
Install NVIDIA Driver and CUDA on Ubuntu / CentOS / Fedora Linux OS