leminhtr/python-tensorflow-pytorch-GPU-nvidia-cuda-linux-clean-install-instructions.md

## python-tensorflow-pytorch-GPU-nvidia-cuda-linux-clean-install-instructions.md

      
    Raw
  

              python-tensorflow-pytorch-GPU-nvidia-cuda-linux-clean-install-instructions.md
            
          
    Instructions

I. Clean Python setup from scratch. (~1h) Skip if you already have a python environment setup or want to use your own python virtualenv setup

0. Pre-install (skip if already done)

sudo apt-get install python3-pip python-dev
sudo apt-get update;
sudo apt-get install make build-essential libssl-dev zlib1g-dev \
libbz2-dev libreadline-dev libsqlite3-dev wget curl llvm \
libncursesw5-dev xz-utils tk-dev libxml2-dev libxmlsec1-dev libffi-dev liblzma-dev
pip install --upgrade pip
1. pyenv to manage python version and virtualenv easily

curl https://pyenv.run | bash
Add it to ~/.bashrc
#pyenv
export PYENV_ROOT="$HOME/.pyenv"
export PATH="/home/$USER/.pyenv/bin:$PATH"
eval "$(pyenv init -)"
eval "$(pyenv virtualenv-init -)"
And add this to ~/.profile
export PATH="$PYENV_ROOT/bin:$PATH"
eval "$(pyenv init --path)"
export PYENV_ROOT="$HOME/.pyenv"
Reload session:
source .bashrc # reload bash session
Install latest python and make it default:

pyenv install 3.9.5
pyenv global 3.9.5

python -V && which python
should return:
Python 3.9.5
/home/$USER/.pyenv/shims/python

2. pipx: Install and Run Python Applications in Isolated Environments without ruining your global environment

python -m pip install --user pipx
If pipx is not found (not in $PATH) then run:
python -m pipx ensurepath
Now use pipx instead of pip to install/run python standalone apps/git repos (!= python package)
-> Avoid installing package globally... High chance of breaking everything on updates/install
Install jupyter notebook system-wide


pipx install notebook
pipx install jupyter --include-deps
pipx install jupyterlab

To make your future pyenv-virtualenv available with jupyter, use pyenv-jupyter-kernel plugin:
git clone https://github.com/aiguofer/pyenv-jupyter-kernel $(pyenv root)/plugins/pyenv-jupyter-kernel
3. Poetry system-wide for package management/update dependencies

pipx install poetry
Verify install:
pipx list
which jupyter-lab
II. Install CUDA, NVIDIA drivers, libcudnn (/!\ Updated installation instructions are always at  https://www.tensorflow.org/install/gpu )

0. Verify install (Skip to TensorFlow/PyTorch install if ok)


OFFICIALLY TESTED AND COMPATIBLE GPU CONFIGURATIONS FOR EACH TENSORFLOW AND CUDA/CUDNN CAN BE FOUND AT THIS TABLE.
PLEASE, adapt following instructions w.r.t. this table as it contains latest working configurations


Check nvidia driver installation (>450.80.02 or your current version)


nvidia-smi should print GPU info (Printed CUDA version is not accurate)

Check CUDA install:
nvcc -V

should print:
Cuda compilation tools, release 11.2, V11.2.152
Build cuda_11.2.r11.2/compiler.29618528_0
or log about your current cuda version
NOW: Install EACH individual (eventual) missing packages from this step 0. skip otherwise
1. Install NVIDIA package repositories for Ubuntu 20.04 and CUDA 11.2

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin &&
sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600 &&
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/7fa2af80.pub &&
sudo add-apt-repository "deb http://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/ /" ;
sudo apt-get update;
If you notice problems with GPG keys when running above commands, try this: (from https://github.com/NVIDIA/nvidia-docker/issues/1632#issuecomment-1112770026 and https://github.com/NVIDIA/nvidia-docker/issues/1632#issuecomment-1125739652)
sudo apt-key del 7fa2af80
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/3bf863cc.pub
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu2004/x86_64/7fa2af80.pub
sudo add-apt-repository "deb http://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/ /" ;
sudo apt-get update;
wget http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu2004/x86_64/nvidia-machine-learning-repo-ubuntu2004_1.0.0-1_amd64.deb ;
sudo dpkg -i nvidia-machine-learning-repo-ubuntu2004_1.0.0-1_amd64.deb ;
sudo apt-get update;
Note: Latest links/packages can be found in the official NVIDIA repos using ctrl+F at Cuda Ubuntu 20.04 repos and Nvidia ML repo Ubuntu 20.04
2. Install NVIDIA drivers: /!\ Skip if you already have NVIDIA drivers installed


sudo ubuntu-drivers devices
should return a list of compatible/recommended drivers (e.g. driver   : nvidia-driver-510 - third-party free recommended)


If the driver version is associated/ends with -open then DO NOT install it. (Some issues to match cuda version dependencies?) e.g., nvidia-driver-525-open, just pick another driver version XXX where there is no nvidia-driver-XXX-open listed.
Else: Pick the version with the recommended version.


sudo apt-get install nvidia-driver-{#RECOMMENDED-VERSION-NUMBER}

If you encounter package issues/conflicts then try to resolve them with aptitude instead of apt-get:

sudo apt-get install aptitude
sudo aptitude install -f nvidia-driver-{#RECOMMENDED-VERSION-NUMBER}
Try to figure which solution would resolve the conflicts/dependencies (could be old driver versions, previous cuda install, ...)
sudo apt-get install nvidia-driver-{#RECOMMENDED-VERSION-NUMBER}


sudo reboot


Continue if nvidia-smi returns a valid output
3. Install CUDA 11.2 and libcudnn 8.1.0 for CUDA 11.2

To get the latest/appropriate cuda version, you may find the .deb package files at Cuda Ubuntu 20.04 repos and look for ctrl+f libcudnn8_*.deb and libcudnn8-dev.deb* and then download these two .deb files by copying the URLs and install them.
sudo apt-get install --no-install-recommends \
    cuda-11-2;
sudo apt-get autoremove \
cd ~/Downloads &&
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/libcudnn8_8.1.0.77-1+cuda11.2_amd64.deb ;
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/libcudnn8-dev_8.1.0.77-1+cuda11.2_amd64.deb ;
sudo dpkg -i libcudnn8_8.1.0.77-1+cuda11.2_amd64.deb ;
sudo dpkg -i libcudnn8-dev_8.1.0.77-1+cuda11.2_amd64.deb
Note: If apt-get install cuda-11-2 fails then try either:
- sudo aptitude install cuda-11-2 then try to solve dependency issues.
- sudo apt-get cuda-toolkit-11-2
which installs cuda in: /usr/local/cuda/bin/
then install the 2 other required packages:
Add this to your ~/.bashrc: From docs.nvidia.com
# NVIDIA CUDA 11.x
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-11/lib64
export CUDA_HOME=/usr/local/cuda-11/
export PATH="/usr/local/cuda-11/bin:$PATH"
source .bashrc # Reload session
Continue if nvcc -V returns a valid output.
sudo reboot
Prevent NVIDIA/CUDA from upgrading:

Source: https://chrisalbon.com/code/deep_learning/setup/prevent_nvidia_drivers_from_upgrading/
Since TensorFlow/Pytorch must match one specific version of CUDA (e.g. 11.0 != 11.1), we must freeze cuda update using apt:
sudo apt-mark hold libcudnn8 libcudnn8-dev # Prevent package updates / freeze versions
dpkg-query -W --showformat='${Package} ${Status}\n' | grep -v deinstall | awk '{ print $1 }' | \
    grep -E 'nvidia.*-[0-9]+$' | \
    xargs -r -L 1 sudo apt-mark hold
To unfreeze:
sudo apt-mark unhold <package-name>
III. Install TensorFlow and PyTorch

1. Create/activate virtual env

/!\ Please don't install tensorflow globally with pip/pipx...
If you use pyenv & jupyter and already have created virtualenv, you can register all of your pyenv-virtualenv in jupyter with:
pyenv versions --bare | grep -v "/" | xargs -L 1 pyenv register-kernel
Create a virtualenv from version 3.9.5:
pyenv virtualenv 3.9.5 mygputest
or pyenv virtualenv mygputest if 3.9.5 is python global version

pyenv virtualenvs # list all virtualenvs
pyenv activate mygputest

Deactivating:
pyenv deactivate
2. Install TensorFlow

With your virtualenv activated:
python -m pip install tensorflow
Should be 2.5.X or current
3. Install PyTorch 1.10.2 & PyTorch Lightning & Lightning Flash .  /!\ Latest installation instructions are  always at  https://pytorch.org/get-started/locally/ and pytorch repo list is at  https://download.pytorch.org/whl/torch_stable.html

python -m pip install torch==1.10.2+cu111 torchaudio==0.10.2+cu111 torchvision==0.11.3+cu111 -f https://download.pytorch.org/whl/torch_stable.html
python -m pip install pytorch-lightning lightning-flash
4. Useful ML/DL packages:


Common:

python -m pip install scikit-learn pandas matplotlib seaborn bokeh
python -m pip install botorch # bayesian optimization on pytorch
python -m pip install opencv-python


Audio:

pyaudio: sudo apt-get install libjack-jackd2-dev portaudio19-dev then python -m pip install pyaudio


Meta-opt:

python -m pip install keras-tuner


IV. Verify deep learning setup on GPU:

0. Monitor GPU usage:

You may keep this running in a side terminal
watch -d -n 2 nvidia-smi # GPU usage cuda nvidia task manager taskmgr memory
1. TensorFlow:

With your virtualenv activated:
python -c "import tensorflow as tf;print(tf.__version__); print(tf.config.list_physical_devices('GPU'))"
Should return:

current tensorflow version
last line should be:
'[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]'

2. PyTorch

With your virtualenv activated:
python -c 'import torch; print(torch.rand(2,3).cuda())'
It should return a random tensor with device cuda:0 such as:
tensor([[0.2551, 0.1373, 0.3072],[0.9524, 0.2616, 0.5635]], device='cuda:0')
V. Train your first deep learning model on GPU:

TensorFlow 2.X:

Official TensorFlow Keras MNIST tutorial | Official TensorFlow advanced MNIST tutorial
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import datasets, layers, models
import numpy as np

# prepare data
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
x_train, x_test = x_train / 256.0, x_test / 256.0
x_train = np.expand_dims(x_train, -1)
x_test = np.expand_dims(x_test, -1)

# create model

model = keras.Sequential(
    [
        keras.Input(shape=(28, 28, 1)),
        layers.Conv2D(32, kernel_size=(3, 3), activation="relu"),
        layers.MaxPooling2D(pool_size=(2, 2)),
        layers.Conv2D(64, kernel_size=(3, 3), activation="relu"),
        layers.MaxPooling2D(pool_size=(2, 2)),
        layers.Flatten(),
        layers.Dropout(0.5),
        layers.Dense(10, activation="softmax"),
    ]
)

# train and test model
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.fit(x_train, y_train, epochs=10, validation_data=(x_test, y_test))
model.evaluate(x_test, y_test)
You should expect ~98% in test accuracy.
PyTorch 1.X.X & PyTorch Lightning & Lightning-Flash

import flash
from torch import nn, optim
from torch.utils.data import DataLoader, random_split, Subset
from torchvision import transforms, datasets


# model
model = nn.Sequential(
    nn.Conv2d(1, 32, kernel_size=3, stride=1),
    nn.ReLU(),
    nn.MaxPool2d(kernel_size=2, stride=2),
    nn.Conv2d(32, 64, kernel_size=3, stride=1),
    nn.ReLU(),
    nn.MaxPool2d(kernel_size=2, stride=2),

    nn.Flatten(),
    nn.Dropout(0.5),
    nn.Linear(5 * 5 * 64, 10)
)

# data
#dataset = datasets.MNIST('./data_folder', download=True, transform=transforms.ToTensor())
tr = datasets.MNIST('./data_folder', train=True, download=True, transform=transforms.ToTensor())
te = datasets.MNIST('./data_folder', train=False, transform=transforms.ToTensor())

part_tr = random_split(tr, [1875, len(tr)-1875])[0]
part_te = random_split(te, [313, len(te)-313])[0]

# task
classifier = flash.Task(model, loss_fn=nn.functional.cross_entropy, optimizer=optim.Adam)

# train
flash.Trainer(max_epochs=10, accelerator='gpu', devices=1).fit(classifier, DataLoader(part_tr, num_workers=32), DataLoader(part_te, num_workers=32))
Optional: Run remote Jupyter server in local browser via SSH Tunneling

Suppose you will run jupyter in port 8888 (server) and forward it to your own (local) port 8888 (Reference command is: ssh -L $client_port:localhost:$server_port login@remote_server)


Connect to your server via ssh:
ssh -L 8888:localhost:8888 your_login@remote_server


Start the jupyter server on remote server: jupyter-lab # (by default on port 8888)


Then just copy paste the prompted url in your local browser (e.g.: http://localhost:8888/?token=2b58c8deb1cb467c6b0491504c0e0a1593cd7923af077606).


Finally, in the jupyter lab browser window, create a new notebook with your selected virtualenv kernel.


For SSH Tunneling with Putty, you can find quick instructions here
Source: DigitalOcean