jeremybower/setup_deep_learning_server_with_gpu.md

## setup_deep_learning_server_with_gpu.md

      
    Raw
  

              setup_deep_learning_server_with_gpu.md
            
          
    April 30, 2018
Setup a Server with GPU for Deep Learning

Setup steps for Keras backed by TensorFlow and Jupyter Notebooks on a server with GPU.
Hardware and OS

I'm using Ubuntu 16.04 and a server with 2 CPUs and 1 GPU. On GCP, an n1-standard-2 instance with an NVIDIA K80 GPU fits the bill. 1 CPU gets maxed out. A second CPU handles the additional load and maximizes the GPU.
If you're using GCP, be sure to check the box not to delete the boot disk when you create the instance. That way, you can shutdown the instance (so you're not paying for it), but the setup will persist and you can later restart the instance. A boot disk with at least 20GB is required. You will need to add to that minimum depending on how much data you need locally. On GCP, you can easily increase the size of the disk at any point, so there's no need to over-provision it.
Here's how disk space looks after this setup:
$ df -h
Filesystem      Size  Used Avail Use% Mounted on
udev            3.7G     0  3.7G   0% /dev
tmpfs           748M  9.0M  739M   2% /run
/dev/sda1        20G   18G  2.0G  90% /
tmpfs           3.7G  108K  3.7G   1% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           3.7G     0  3.7G   0% /sys/fs/cgroup
tmpfs           748M   28K  748M   1% /run/user/113
tmpfs           748M     0  748M   0% /run/user/1001
Setup CUDA

Install CUDA driver and toolkit:
$ wget http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-repo-ubuntu1604_9.1.85-1_amd64.deb
$ sudo dpkg -i cuda-repo-ubuntu1604_9.1.85-1_amd64.deb
$ sudo apt-get update
$ sudo apt-get upgrade
$ sudo apt-get install cuda -y
$ rm cuda-repo-ubuntu1604_9.1.85-1_amd64.deb
Reboot the server.
$ sudo reboot
Verify that the NVIDIA graphics driver can be loaded:
$ nvidia-smi
You should see something like this:
Wed Apr 25 21:22:51 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 390.30                 Driver Version: 390.30                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   31C    P8    30W / 149W |     16MiB / 11441MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1432      G   /usr/lib/xorg/Xorg                            15MiB |
+-----------------------------------------------------------------------------+

Set environment variables:
$ echo 'export CUDA_HOME=/usr/local/cuda' >> ~/.bashrc
$ echo 'export PATH=$PATH:$CUDA_HOME/bin' >> ~/.bashrc
$ echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CUDA_HOME/lib64' >> ~/.bashrc
$ source ~/.bashrc
Install cuDNN

You need to download the cuDNN installer directly from NVIDIA. First, register as a developer:
https://developer.nvidia.com/cudnn
Then, download cudnn-8.0-linux-x64-v5.1.tgz and upload it to your server:
$ scp -i .ssh/your_identity cudnn-8.0-linux-x64-v5.1.tgz <external-IP-of-GPU-instance>:
Install:
$ tar xzvf cudnn-9.0-linux-x64-v7.1.tgz
$ sudo cp cuda/lib64/* /usr/local/cuda/lib64/
$ sudo cp cuda/include/cudnn.h /usr/local/cuda/include/
$ rm -rf ~/cuda
$ rm cudnn-9.0-linux-x64-v7.1.tgz
At this point, all the NVIDIA/CUDA setup is complete.
Install Anaconda

$ wget https://repo.anaconda.com/archive/Anaconda3-5.1.0-Linux-x86_64.sh
$ chmod u+x Anaconda3-5.1.0-Linux-x86_64.sh
$ ./Anaconda3-5.1.0-Linux-x86_64.sh
Follow the prompts to accept the license and add Anaconda to your path. Then reload your shell and update conda:
$ source ~/.bashrc
$ conda update -n base conda
Now, create an environment (replace  below!) in which to install Keras and TensorFlow:
$ conda create --name <myenv> python=3.6
$ source activate <myenv>
$ conda install jupyter
$ conda install keras-gpu tensorflow-gpu
$ conda install -c pytorch pytorch torchvision cuda91
$ conda install matplotlib pillow pandas scipy scikit-learn opencv
Check that Tensorflow is properly installed and using the GPU:
$ python
Enter this Python code:
from keras import backend as K
K.tensorflow_backend._get_available_gpus()
You should see this as the last line:
['/job:localhost/replica:0/task:0/device:GPU:0']

You will probably see some warnings and informational log messages. Here is the full output:
$ python
Python 3.6.5 |Anaconda, Inc.| (default, Mar 29 2018, 18:21:58) 
[GCC 7.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from keras import backend as K
/home/jeremy/anaconda3/envs/dlnd/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
Using TensorFlow backend.
>>> K.tensorflow_backend._get_available_gpus()
2018-04-25 23:20:26.784621: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-04-25 23:20:27.283329: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-04-25 23:20:27.283676: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 0 with properties: 
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:00:04.0
totalMemory: 11.17GiB freeMemory: 11.09GiB
2018-04-25 23:20:27.283705: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0
2018-04-25 23:20:27.564429: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-04-25 23:20:27.564514: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917]      0 
2018-04-25 23:20:27.564525: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0:   N 
2018-04-25 23:20:27.564800: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10749 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:04.0, compute capability: 3.7)
['/job:localhost/replica:0/task:0/device:GPU:0']

Setup Jupyter Notebooks

Generate the config:
$ jupyter notebook --generate-config
It should print out the location of the config file. Use your favorite editor to add the following to the top of the file:
c = get_config()  # get the config object
c.NotebookApp.open_browser = False  # do not open a browser window
c.NotebookApp.token = '' # we always use jupyter over an ssh tunnel
c.NotebookApp.port = 8888 # default port

Start the server

Disconnect from your server and then start a new SSH session with tunnelling for the port:
$ ssh -i ~/.ssh/<identity> -L 8888:localhost:8888 <public server IP>
$ source activate <myenv>
$ jupyter notebook
Now load http://localhost:8888 in your local browser and you're done! 🙌