Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 7 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save leminhtr/3780003ad55e49b2b83b75b718a1c4ab to your computer and use it in GitHub Desktop.
Save leminhtr/3780003ad55e49b2b83b75b718a1c4ab to your computer and use it in GitHub Desktop.
Clean Python Deep Learning GPU setup with TensorFlow 2.X.X & PyTorch 1.X and GPU installation instructions for Ubuntu 20.04 - CUDA 11.X

Instructions

I. Clean Python setup from scratch. (~1h) Skip if you already have a python environment setup or want to use your own python virtualenv setup

0. Pre-install (skip if already done)

sudo apt-get install python3-pip python-dev
sudo apt-get update;
sudo apt-get install make build-essential libssl-dev zlib1g-dev \
libbz2-dev libreadline-dev libsqlite3-dev wget curl llvm \
libncursesw5-dev xz-utils tk-dev libxml2-dev libxmlsec1-dev libffi-dev liblzma-dev

pip install --upgrade pip

1. pyenv to manage python version and virtualenv easily

curl https://pyenv.run | bash

Add it to ~/.bashrc

#pyenv
export PYENV_ROOT="$HOME/.pyenv"
export PATH="/home/$USER/.pyenv/bin:$PATH"
eval "$(pyenv init -)"
eval "$(pyenv virtualenv-init -)"

And add this to ~/.profile

export PATH="$PYENV_ROOT/bin:$PATH"
eval "$(pyenv init --path)"
export PYENV_ROOT="$HOME/.pyenv"

Reload session: source .bashrc # reload bash session

Install latest python and make it default:

  • pyenv install 3.9.5
  • pyenv global 3.9.5

python -V && which python should return:

Python 3.9.5
/home/$USER/.pyenv/shims/python

2. pipx: Install and Run Python Applications in Isolated Environments without ruining your global environment

python -m pip install --user pipx

If pipx is not found (not in $PATH) then run:

python -m pipx ensurepath

Now use pipx instead of pip to install/run python standalone apps/git repos (!= python package)

-> Avoid installing package globally... High chance of breaking everything on updates/install

Install jupyter notebook system-wide

  • pipx install notebook
  • pipx install jupyter --include-deps
  • pipx install jupyterlab

To make your future pyenv-virtualenv available with jupyter, use pyenv-jupyter-kernel plugin:

git clone https://github.com/aiguofer/pyenv-jupyter-kernel $(pyenv root)/plugins/pyenv-jupyter-kernel

3. Poetry system-wide for package management/update dependencies

pipx install poetry

Verify install: pipx list which jupyter-lab

II. Install CUDA, NVIDIA drivers, libcudnn (/!\ Updated installation instructions are always at https://www.tensorflow.org/install/gpu )

0. Verify install (Skip to TensorFlow/PyTorch install if ok)

  1. OFFICIALLY TESTED AND COMPATIBLE GPU CONFIGURATIONS FOR EACH TENSORFLOW AND CUDA/CUDNN CAN BE FOUND AT THIS TABLE. PLEASE, adapt following instructions w.r.t. this table as it contains latest working configurations

  2. Check nvidia driver installation (>450.80.02 or your current version)

nvidia-smi should print GPU info (Printed CUDA version is not accurate)

  1. Check CUDA install: nvcc -V

should print:

Cuda compilation tools, release 11.2, V11.2.152
Build cuda_11.2.r11.2/compiler.29618528_0

or log about your current cuda version

NOW: Install EACH individual (eventual) missing packages from this step 0. skip otherwise

1. Install NVIDIA package repositories for Ubuntu 20.04 and CUDA 11.2

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin &&
sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600 &&
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/7fa2af80.pub &&
sudo add-apt-repository "deb http://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/ /" ;
sudo apt-get update;

If you notice problems with GPG keys when running above commands, try this: (from https://github.com/NVIDIA/nvidia-docker/issues/1632#issuecomment-1112770026 and https://github.com/NVIDIA/nvidia-docker/issues/1632#issuecomment-1125739652)

sudo apt-key del 7fa2af80
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/3bf863cc.pub
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu2004/x86_64/7fa2af80.pub
sudo add-apt-repository "deb http://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/ /" ;
sudo apt-get update;
wget http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu2004/x86_64/nvidia-machine-learning-repo-ubuntu2004_1.0.0-1_amd64.deb ;
sudo dpkg -i nvidia-machine-learning-repo-ubuntu2004_1.0.0-1_amd64.deb ;
sudo apt-get update;

Note: Latest links/packages can be found in the official NVIDIA repos using ctrl+F at Cuda Ubuntu 20.04 repos and Nvidia ML repo Ubuntu 20.04

2. Install NVIDIA drivers: /!\ Skip if you already have NVIDIA drivers installed

  1. sudo ubuntu-drivers devices should return a list of compatible/recommended drivers (e.g. driver : nvidia-driver-510 - third-party free recommended)
  • If the driver version is associated/ends with -open then DO NOT install it. (Some issues to match cuda version dependencies?) e.g., nvidia-driver-525-open, just pick another driver version XXX where there is no nvidia-driver-XXX-open listed.
  • Else: Pick the version with the recommended version.
  1. sudo apt-get install nvidia-driver-{#RECOMMENDED-VERSION-NUMBER}

    • If you encounter package issues/conflicts then try to resolve them with aptitude instead of apt-get:
      1. sudo apt-get install aptitude
      2. sudo aptitude install -f nvidia-driver-{#RECOMMENDED-VERSION-NUMBER}
      3. Try to figure which solution would resolve the conflicts/dependencies (could be old driver versions, previous cuda install, ...)
      4. sudo apt-get install nvidia-driver-{#RECOMMENDED-VERSION-NUMBER}
  2. sudo reboot

Continue if nvidia-smi returns a valid output

3. Install CUDA 11.2 and libcudnn 8.1.0 for CUDA 11.2

To get the latest/appropriate cuda version, you may find the .deb package files at Cuda Ubuntu 20.04 repos and look for ctrl+f libcudnn8_*.deb and libcudnn8-dev.deb* and then download these two .deb files by copying the URLs and install them.

sudo apt-get install --no-install-recommends \
    cuda-11-2;
sudo apt-get autoremove \
cd ~/Downloads &&
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/libcudnn8_8.1.0.77-1+cuda11.2_amd64.deb ;
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/libcudnn8-dev_8.1.0.77-1+cuda11.2_amd64.deb ;
sudo dpkg -i libcudnn8_8.1.0.77-1+cuda11.2_amd64.deb ;
sudo dpkg -i libcudnn8-dev_8.1.0.77-1+cuda11.2_amd64.deb

Note: If apt-get install cuda-11-2 fails then try either: - sudo aptitude install cuda-11-2 then try to solve dependency issues. - sudo apt-get cuda-toolkit-11-2 which installs cuda in: /usr/local/cuda/bin/ then install the 2 other required packages:

Add this to your ~/.bashrc: From docs.nvidia.com

# NVIDIA CUDA 11.x
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-11/lib64
export CUDA_HOME=/usr/local/cuda-11/
export PATH="/usr/local/cuda-11/bin:$PATH"

source .bashrc # Reload session

Continue if nvcc -V returns a valid output.

sudo reboot

Prevent NVIDIA/CUDA from upgrading:

Source: https://chrisalbon.com/code/deep_learning/setup/prevent_nvidia_drivers_from_upgrading/

Since TensorFlow/Pytorch must match one specific version of CUDA (e.g. 11.0 != 11.1), we must freeze cuda update using apt:

sudo apt-mark hold libcudnn8 libcudnn8-dev # Prevent package updates / freeze versions
dpkg-query -W --showformat='${Package} ${Status}\n' | grep -v deinstall | awk '{ print $1 }' | \
    grep -E 'nvidia.*-[0-9]+$' | \
    xargs -r -L 1 sudo apt-mark hold

To unfreeze:

sudo apt-mark unhold <package-name>

III. Install TensorFlow and PyTorch

1. Create/activate virtual env

/!\ Please don't install tensorflow globally with pip/pipx...

If you use pyenv & jupyter and already have created virtualenv, you can register all of your pyenv-virtualenv in jupyter with: pyenv versions --bare | grep -v "/" | xargs -L 1 pyenv register-kernel

Create a virtualenv from version 3.9.5: pyenv virtualenv 3.9.5 mygputest or pyenv virtualenv mygputest if 3.9.5 is python global version

  • pyenv virtualenvs # list all virtualenvs
  • pyenv activate mygputest

Deactivating: pyenv deactivate

2. Install TensorFlow

With your virtualenv activated: python -m pip install tensorflow Should be 2.5.X or current

3. Install PyTorch 1.10.2 & PyTorch Lightning & Lightning Flash . /!\ Latest installation instructions are always at https://pytorch.org/get-started/locally/ and pytorch repo list is at https://download.pytorch.org/whl/torch_stable.html

python -m pip install torch==1.10.2+cu111 torchaudio==0.10.2+cu111 torchvision==0.11.3+cu111 -f https://download.pytorch.org/whl/torch_stable.html
python -m pip install pytorch-lightning lightning-flash

4. Useful ML/DL packages:

  • Common:

    • python -m pip install scikit-learn pandas matplotlib seaborn bokeh
    • python -m pip install botorch # bayesian optimization on pytorch
    • python -m pip install opencv-python
  • Audio:

    • pyaudio: sudo apt-get install libjack-jackd2-dev portaudio19-dev then python -m pip install pyaudio
  • Meta-opt:

    • python -m pip install keras-tuner

IV. Verify deep learning setup on GPU:

0. Monitor GPU usage:

You may keep this running in a side terminal watch -d -n 2 nvidia-smi # GPU usage cuda nvidia task manager taskmgr memory

1. TensorFlow:

With your virtualenv activated: python -c "import tensorflow as tf;print(tf.__version__); print(tf.config.list_physical_devices('GPU'))"

Should return:

  • current tensorflow version
  • last line should be: '[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]'

2. PyTorch

With your virtualenv activated: python -c 'import torch; print(torch.rand(2,3).cuda())'

It should return a random tensor with device cuda:0 such as: tensor([[0.2551, 0.1373, 0.3072],[0.9524, 0.2616, 0.5635]], device='cuda:0')

V. Train your first deep learning model on GPU:

TensorFlow 2.X:

Official TensorFlow Keras MNIST tutorial | Official TensorFlow advanced MNIST tutorial

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import datasets, layers, models
import numpy as np

# prepare data
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
x_train, x_test = x_train / 256.0, x_test / 256.0
x_train = np.expand_dims(x_train, -1)
x_test = np.expand_dims(x_test, -1)

# create model

model = keras.Sequential(
    [
        keras.Input(shape=(28, 28, 1)),
        layers.Conv2D(32, kernel_size=(3, 3), activation="relu"),
        layers.MaxPooling2D(pool_size=(2, 2)),
        layers.Conv2D(64, kernel_size=(3, 3), activation="relu"),
        layers.MaxPooling2D(pool_size=(2, 2)),
        layers.Flatten(),
        layers.Dropout(0.5),
        layers.Dense(10, activation="softmax"),
    ]
)

# train and test model
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.fit(x_train, y_train, epochs=10, validation_data=(x_test, y_test))
model.evaluate(x_test, y_test)

You should expect ~98% in test accuracy.

import flash
from torch import nn, optim
from torch.utils.data import DataLoader, random_split, Subset
from torchvision import transforms, datasets


# model
model = nn.Sequential(
    nn.Conv2d(1, 32, kernel_size=3, stride=1),
    nn.ReLU(),
    nn.MaxPool2d(kernel_size=2, stride=2),
    nn.Conv2d(32, 64, kernel_size=3, stride=1),
    nn.ReLU(),
    nn.MaxPool2d(kernel_size=2, stride=2),

    nn.Flatten(),
    nn.Dropout(0.5),
    nn.Linear(5 * 5 * 64, 10)
)

# data
#dataset = datasets.MNIST('./data_folder', download=True, transform=transforms.ToTensor())
tr = datasets.MNIST('./data_folder', train=True, download=True, transform=transforms.ToTensor())
te = datasets.MNIST('./data_folder', train=False, transform=transforms.ToTensor())

part_tr = random_split(tr, [1875, len(tr)-1875])[0]
part_te = random_split(te, [313, len(te)-313])[0]

# task
classifier = flash.Task(model, loss_fn=nn.functional.cross_entropy, optimizer=optim.Adam)

# train
flash.Trainer(max_epochs=10, accelerator='gpu', devices=1).fit(classifier, DataLoader(part_tr, num_workers=32), DataLoader(part_te, num_workers=32))

Optional: Run remote Jupyter server in local browser via SSH Tunneling

Suppose you will run jupyter in port 8888 (server) and forward it to your own (local) port 8888 (Reference command is: ssh -L $client_port:localhost:$server_port login@remote_server)

  1. Connect to your server via ssh: ssh -L 8888:localhost:8888 your_login@remote_server

  2. Start the jupyter server on remote server: jupyter-lab # (by default on port 8888)

  3. Then just copy paste the prompted url in your local browser (e.g.: http://localhost:8888/?token=2b58c8deb1cb467c6b0491504c0e0a1593cd7923af077606).

  4. Finally, in the jupyter lab browser window, create a new notebook with your selected virtualenv kernel.

For SSH Tunneling with Putty, you can find quick instructions here

Source: DigitalOcean

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment