sub-mod/cuda_10_rhel_8.md

## cuda_10_rhel_8.md

      
    Raw
  

              cuda_10_rhel_8.md
            
          
    Using TF 1.15 with CUDA 10.0 on RHEL 8.3 is not easy.
If you upgrade the driver the TensorFlow setup with CUDA may stop working.
Setup RHEL - 8.3.1 with Nvidia drivers for TensorFlow-2.x and TensorFlow 1.15.x.
These steps work for RHEL-8.3.1

Uninstall current driver and cuda
Install latest NVIDIA Driver
Setup CUDA for TensorFlow 2.4.0
Setup CUDA for TensorFlow 1.15.4

Uninstall-old-driver

My current Driver Version on RHEL-8 is 440.31 and I wanted to update to latest version 460.32.03 .
check if you already have cuda and nvidia drivers

# nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.31    Driver Version: 440.31    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:65:00.0 Off |                  N/A |
| 24%   35C    P0    28W / 257W |      0MiB / 11016MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

uninstall the nvidia-driver

# ./NVIDIA-Linux-x86_64-440.31.run --uninstall
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 440.31................

Note : This step will most probably mess-up your previous setup for TensorFlow.
delete the cuda & nvidia rpms

# rpm -qa | grep cuda
cuda-repo-rhel8-10.2.89-1.x86_64
cuda-nsight-compute-11-1-11.1.1-1.x86_64
cuda-repo-rhel8-10-2-local-10.2.89-440.33.01-1.0-1.x86_64
cuda-repo-rhel7-10-0-local-10.0.130-410.48-1.0-1.x86_64
cuda-repo-rhel8-10-1-local-10.1.243-418.87.00-1.0-1.x86_64

# rpm -qa | grep nvidia
cuda-repo-rhel8-10.2.89-1.x86_64
cuda-nsight-compute-11-1-11.1.1-1.x86_64
cuda-repo-rhel8-10-2-local-10.2.89-440.33.01-1.0-1.x86_64
cuda-repo-rhel7-10-0-local-10.0.130-410.48-1.0-1.x86_64
cuda-repo-rhel8-10-1-local-10.1.243-418.87.00-1.0-1.x86_64

# rpm -e <package-name>

Install-latest-nvidia-driver

download driver

Nvidia drivers are available from https://www.nvidia.com/download/index.aspx?lang=en-us
As of Jan-20-2020 the latest driver for my GPU was NVIDIA-Linux-x86_64-460.32.03.run
# chmod 777 NVIDIA-Linux-x86_64-460.32.03.run

prerequisites

Check the system for GPU and gcc.
# uname -r
4.18.0-240.10.1.el8_3.x86_64

# lspci | grep -i nvidia
65:00.0 VGA compatible controller: NVIDIA Corporation TU102 [GeForce RTX 3080] (rev a1)
65:00.1 Audio device: NVIDIA Corporation TU102 High Definition Audio Controller (rev a1)
65:00.2 USB controller: NVIDIA Corporation TU102 USB 3.1 Host Controller (rev a1)
65:00.3 Serial bus controller [0c80]: NVIDIA Corporation TU102 USB Type-C UCSI Controller (rev a1)

# gcc --version
gcc (GCC) 8.3.1 20191121 (Red Hat 8.3.1-5)

Let's update the RHEL-8 system.
# sudo yum update -y

# sudo dnf install kernel-devel-$(uname -r) kernel-headers-$(uname -r)

install driver

sudo ./NVIDIA-Linux-x86_64-460.32.03.run
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 460.32.03..  ..

check if  nvidia-smi works

# nvidia-smi
Wed Jan 20 16:51:17 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:65:00.0 Off |                  N/A |
| 24%   35C    P0    28W / 257W |      0MiB / 11016MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

IMPORTANT NOTE: If you look at the nvidia-smi putput you notice that the CUDA Version for the Driver Version 460.32.03  is 11.2 but TensorFlow as of version 2.4.0 only supports CUDA 11.0. So, we are not going to install CUDA 11.2. Instead we will proceed to install CUDA-11.0 .
TensorFlow-2.4.x-on-RHEL-8.3

TensorFlow 2.4.0 requires CUDA-11.0 .
Download the cuda-11.0 rpm for RHEL-8 from https://developer.nvidia.com/cuda-toolkit-archive
# wget https://developer.download.nvidia.com/compute/cuda/11.0.3/local_installers/cuda-repo-rhel8-11-0-local-11.0.3_450.51.06-1.x86_64.rpm

IMPORTANT NOTE: If you look at the rhel8-11-0 rpm the  Driver Version is different 450.51.06 and not 460.32.03.
# sudo rpm -i cuda-repo-rhel8-11-0-local-11.0.3_450.51.06-1.x86_64.rpm
# sudo dnf clean all

# dnf install cuda
Updating Subscription Management repositories.
cuda-rhel8-11-0-local                                                                                                                                                             103 MB/s | 105 kB     00:00
Extra Packages for Enterprise Linux Modular 8 - x86_64                                                                                                                            256 kB/s | 537 kB     00:02
Extra Packages for Enterprise Linux 8 - x86_64                                                                                                                                    2.9 MB/s | 8.8 MB     00:02
Red Hat Enterprise Linux 8 for x86_64 - BaseOS (RPMs)                                                                                                                              14 MB/s |  27 MB     00:01
Red Hat Enterprise Linux 8 for x86_64 - AppStream (RPMs)                                                                                                                           10 MB/s |  25 MB     00:02
Dependencies resolved.
==================================================================================================================================================================================================================
 Package                                                 Architecture                      Version                                              Repository                                                   Size
==================================================================================================================================================================================================================
Installing:
 cuda                                                    x86_64                            11.0.3-1                                             cuda-rhel8-11-0-local                                       2.7 k
Installing dependencies:
 cuda-11-0                                               x86_64                            11.0.3-1                                             cuda-rhel8-11-0-local                                       2.8 k
 cuda-drivers                                            x86_64                            450.51.06-1

check if  nvidia-smi works

# nvidia-smi
Wed Jan 20 18:07:38 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06    Driver Version: 450.51.06    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:65:00.0 Off |                  N/A |
| 24%   36C    P0    27W / 257W |      0MiB / 11016MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

IMPORTANT NOTE: If you look at the nvidia-smi putput you notice that the the Driver Version is 450.51.06 and not 460.32.03.
Install cudnn-8.0.4 for CUDA-11.0

if you don't install the cudnn then TensorFlow is going to throw error saying that it can't find the GPU device because it can't load dynamic library libcudnn.so.8.
>>> print("TF version: ", tf.__version__)
TF version:  2.4.0
>>> print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
2021-01-20 17:57:11.951188: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-01-20 17:57:11.951205: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2021-01-20 17:57:11.951213: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2021-01-20 17:57:11.951221: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-01-20 17:57:11.951228: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-01-20 17:57:11.951236: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-01-20 17:57:11.951243: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2021-01-20 17:57:11.951290: W 2021-01-20 18:04:48.881243: W tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcudnn.so.8'; dlerror: libcudnn.so.8: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda-11.0/lib64
2021-01-20 18:04:48.881249: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1757] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...

prerequisites

# sudo yum -y install kernel-devel-`uname -r` kernel-headers-`uname -r`
# sudo yum -y install https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm
# sudo yum -y install dkms


Download cudnn

cudnn-8.0.4 is available from https://developer.nvidia.com/rdp/cudnn-archive.
# tar -xvf cudnn-11.0-linux-x64-v8.0.4.30.tgz
cuda/include/cudnn.h
cuda/include/cudnn_adv_infer.h
cuda/include/cudnn_adv_train.h
cuda/include/cudnn_backend.h
cuda/include/cudnn_cnn_infer.h
cuda/include/cudnn_cnn_train.h
cuda/include/cudnn_ops_infer.h
cuda/include/cudnn_ops_train.h
cuda/include/cudnn_version.h
cuda/NVIDIA_SLA_cuDNN_Support.txt
cuda/lib64/libcudnn.so
cuda/lib64/libcudnn.so.8
cuda/lib64/libcudnn.so.8.0.4
cuda/lib64/libcudnn_adv_infer.so
cuda/lib64/libcudnn_adv_infer.so.8
cuda/lib64/libcudnn_adv_infer.so.8.0.4
cuda/lib64/libcudnn_adv_train.so
cuda/lib64/libcudnn_adv_train.so.8
cuda/lib64/libcudnn_adv_train.so.8.0.4
cuda/lib64/libcudnn_cnn_infer.so
cuda/lib64/libcudnn_cnn_infer.so.8
cuda/lib64/libcudnn_cnn_infer.so.8.0.4
cuda/lib64/libcudnn_cnn_train.so
cuda/lib64/libcudnn_cnn_train.so.8
cuda/lib64/libcudnn_cnn_train.so.8.0.4
cuda/lib64/libcudnn_ops_infer.so
cuda/lib64/libcudnn_ops_infer.so.8
cuda/lib64/libcudnn_ops_infer.so.8.0.4
cuda/lib64/libcudnn_ops_train.so
cuda/lib64/libcudnn_ops_train.so.8
cuda/lib64/libcudnn_ops_train.so.8.0.4
cuda/lib64/libcudnn_static.a
cuda/lib64/libcudnn_static.a

copy cudnn files to /usr/local/cuda-11.0

# sudo cp cuda/include/cudnn*.h /usr/local/cuda-11.0/include
# sudo cp cuda/lib64/libcudnn* /usr/local/cuda-11.0/lib64
# sudo chmod a+r /usr/local/cuda-11.0/include/cudnn*.h /usr/local/cuda-11.0/lib64/libcudnn*

check if GPU is detected with TensorFlow -2.4.0

>>> print("TF version: ", tf.__version__)
TF version:  2.4.0
>>> print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
2021-01-20 18:12:34.375773: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2021-01-20 18:12:34.377637: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublas.so.11
2021-01-20 18:12:34.377684: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcublasLt.so.11
2021-01-20 18:12:34.378409: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcufft.so.10
2021-01-20 18:12:34.378572: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcurand.so.10
2021-01-20 18:12:34.380425: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusolver.so.10
2021-01-20 18:12:34.380856: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcusparse.so.11
2021-01-20 18:12:34.380928: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudnn.so.8
2021-01-20 18:12:34.381675: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
Num GPUs Available:  1

TensorFlow-1.15.x-on-RHEL-8.3

check if GPU is detected with TensorFlow - 1.15.x

>>> print("TF version: ", tf.__version__)
TF version:  1.15.4
>>> print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
2021-01-20 15:30:28.794748: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory
2021-01-20 15:30:28.794783: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory
2021-01-20 15:30:28.794812: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory
2021-01-20 15:30:28.794841: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory
2021-01-20 15:30:28.794869: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory
2021-01-20 15:30:28.794899: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory
2021-01-20 15:30:28.794927: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudnn.so.7'; dlerror: libcudnn.so.7: cannot open shared object file: No such file or directory
2021-01-20 15:30:28.794933: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1662] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...

As you can see without CUDA-10.0 it is not possible to use TensorFlow - 1.15.4.
IMPORTANT NOTE: TensorFlow 1.15.x only supports CUDA 10.0. So, we need to install  CUDA 10.0 on RHEL-8 but https://developer.nvidia.com/cuda-toolkit-archive doesn't list the rhel-8 repo as of now. So we will download the cuda-repo-rhel7-10-0 rpm to install CUDA-11.0
Install CUDA-10.0 rpm

download the cuda rpm from https://developer.nvidia.com/cuda-toolkit-archive
# sudo rpm -ivh cuda-repo-rhel7-10-0-local-10.0.130-410.48-1.0-1.x86_64.rpm
Verifying...                          ################################# [100%]
Preparing...                          ################################# [100%]
Updating / installing...
   1:cuda-repo-rhel7-10-0-local-10.0.1################################# [100%]

We are still using gcc-8.3.1 .
# yum install cuda-10-0
Updating Subscription Management repositories.
Last metadata expiration check: 0:00:17 ago on Wed 20 Jan 2021 06:09:26 PM EST.
Dependencies resolved.
==================================================================================================================================================================================================================
 Package                                                     Architecture                         Version                                     Repository                                                     Size
==================================================================================================================================================================================================================
Installing:
 cuda-10-0                                                   x86_64                               10.0.130-1                                  cuda-10-0-local-10.0.130-410.48                               6.1 k
Installing dependencies:
 cuda-driver-dev-10-0                                        x86_64                               10.0.130-1                                  cuda-10-0-local-10.0.130-410.48                                20 k

check if  nvidia-smi works

# nvidia-smi
Wed Jan 20 19:28:59 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.51.06    Driver Version: 450.51.06    CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  Off  | 00000000:65:00.0 Off |                  N/A |
| 24%   32C    P8    27W / 257W |    158MiB / 11016MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    182307      C   ...-battship/env/bin/python3      155MiB |
+-----------------------------------------------------------------------------+

Note: the Driver Version hasn't changed after installing cuda-10.0 . Which is why we can install and use both cuda-11.0 and cuda-10.0 .
Install cudnn-7.6.5 for CUDA-10.0

Download cudnn

cudnn-7.6.5 is available from https://developer.nvidia.com/rdp/cudnn-archive.
# tar -xvf cudnn-10.0-linux-x64-v7.6.5.32.tgz
cuda/include/cudnn.h
cuda/NVIDIA_SLA_cuDNN_Support.txt
cuda/lib64/libcudnn.so
cuda/lib64/libcudnn.so.7
cuda/lib64/libcudnn.so.7.6.5
cuda/lib64/libcudnn_static.a

copy cudnn files to /usr/local/cuda-10.0

# sudo cp cuda/include/cudnn*.h /usr/local/cuda-10.0/include
# sudo cp cuda/lib64/libcudnn* /usr/local/cuda-10.0/lib64
# sudo chmod a+r /usr/local/cuda-10.0/include/cudnn*.h /usr/local/cuda-10.0/lib64/libcudnn*

check if GPU is detected with TensorFlow - 1.15.x

>>> print("TF version: ", tf.__version__)
TF version:  1.15.4
>>> print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
2021-01-20 18:20:01.798000: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2021-01-20 18:20:01.798009: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-01-20 18:20:01.798016: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2021-01-20 18:20:01.798024: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2021-01-20 18:20:01.798032: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2021-01-20 18:20:01.798039: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2021-01-20 18:20:01.798047: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-01-20 18:20:01.798729: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1767] Adding visible gpu devices: 0
2021-01-20 18:20:01.798749: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1180] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-01-20 18:20:01.798755: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1186]      0
2021-01-20 18:20:01.798759: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 0:   N
2021-01-20 18:20:01.799487: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/device:GPU:0 with  MB memory) -> physical GPU (device: 0, name: GeForce RTX 2080 , pci bus id: 0000:65:00.0, compute capability: 7.5)

from __future__ import absolute_import, division, print_function, unicode_literals
import tensorflow as tf
from tensorflow.python.client import device_lib

def get_available_devices():
    local_device_protos = device_lib.list_local_devices()
    return [x.name for x in local_device_protos]


print("TF version: ", tf.__version__)
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))
print("Devices Available: ", get_available_devices())