Skip to content

Instantly share code, notes, and snippets.

@Glavin001
Last active April 28, 2024 16:48
Show Gist options
  • Save Glavin001/3f73a15ab337a3c88ce645ee41f27c34 to your computer and use it in GitHub Desktop.
Save Glavin001/3f73a15ab337a3c88ce645ee41f27c34 to your computer and use it in GitHub Desktop.
How to update CUDA version for TensorDock

How to update CUDA version for TensorDock

Problem

TensorDock is pre-installed with CUDA 10.1 (old).

For many use-cases, such as Flash Attention 2 require newer versions of CUDA.

One symptom of this is nvcc and nvidia-smi will show different CUDA versions:

nvidia-smi nvcc -V
✅ 12.2 ❌ 10.1
$ nvidia-smi
Tue Sep  5 00:50:23 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090        Off | 00000000:05:00.0 Off |                  N/A |
| 42%   38C    P0             103W / 390W |      2MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243

Solution/Fixes

1. Unhold NVIDIA libraries

By default, nvidia driver versions are held as to prevent them from auto updating

sudo apt-mark unhold nvidia* libnvidia*

2. Install latest NVIDIA CUDA

Go to CUDA downloads and select target. You'll be provided a command like:

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-keyring_1.0-1_all.deb
sudo dpkg -i cuda-keyring_1.0-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda

Run this to allow update apt-get.

3. Install newer CUDA toolkit

For example:

sudo apt-get install cuda-toolkit-12-2

By now you should see /usr/local/cuda-12.2 (or your version) installed:

$ ls -l /usr/local/
total 36
drwxr-xr-x  2 root root 4096 Sep  5 00:48 bin
lrwxrwxrwx  1 root root   22 Sep  5 00:48 cuda -> /etc/alternatives/cuda
lrwxrwxrwx  1 root root   25 Sep  5 00:48 cuda-12 -> /etc/alternatives/cuda-12
drwxr-xr-x 15 root root 4096 Sep  5 00:48 cuda-12.2
drwxr-xr-x  2 root root 4096 Jun 19 21:39 etc
drwxr-xr-x  2 root root 4096 Jun 19 21:39 games
drwxr-xr-x  2 root root 4096 Jun 19 21:39 include
drwxr-xr-x  3 root root 4096 Jun 19 21:39 lib
lrwxrwxrwx  1 root root    9 Jun 19 21:39 man -> share/man
drwxr-xr-x  2 root root 4096 Jun 19 21:39 sbin
drwxr-xr-x  5 root root 4096 Jul  5 03:43 share
drwxr-xr-x  2 root root 4096 Jun 19 21:39 src

Unfortunately, nvcc will still not use it and version is still outdated.

4. Update environments variables for CUDA

Append to ~/.bashrc the following:

CUDA_VERSION="12.2"
export PATH=/usr/local/cuda-${CUDA_VERSION}/bin${PATH:+:${PATH}}$
export LD_LIBRARY_PATH=/usr/local/cuda-${CUDA_VERSION}/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

Then either create a new shell or update current shell with:

$ source ~/.bashrc 

5. Profit

You're good now!

✅ Release 12.2

(tensorml) user@7ff6481e-fbd6-4dda-b12b-ac7b7c1ca4b2:~/axolotl$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0
@Glavin001
Copy link
Author

First step: upgrade Cuda itself

Running

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-keyring_1.0-1_all.deb
sudo dpkg -i cuda-keyring_1.0-1_all.deb
sudo apt-get update
sudo apt-get -y install cuda

fails with:

$ sudo apt-get -y install cuda
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
 cuda : Depends: cuda-12-2 (>= 12.2.2) but it is not going to be installed
E: Unable to correct problems, you have held broken packages.

@Glavin001
Copy link
Author

Glavin001 commented Sep 5, 2023

Unlocking:

sudo apt-mark unhold nvidia* libnvidia*

failed with:

dpkg: error: dpkg frontend lock is locked by another process
E: Sub-process dpkg --set-selections returned an error code (2)
E: Executing dpkg failed. Are you root?

so then thanks to GPT-4:

(tensorml) user@7ff6481e-fbd6-4dda-b12b-ac7b7c1ca4b2:~/axolotl$ sudo lsof /var/lib/dpkg/lock-frontend
COMMAND    PID USER   FD   TYPE DEVICE SIZE/OFF  NODE NAME
unattende 7034 root    8uW  REG  252,1        0 71474 /var/lib/dpkg/lock-frontend
(tensorml) user@7ff6481e-fbd6-4dda-b12b-ac7b7c1ca4b2:~/axolotl$ sudo kill -i 7034
kill: invalid argument i

Usage:
 kill [options] <pid> [...]

Options:
 <pid> [...]            send signal to every <pid> listed
 -<signal>, -s, --signal <signal>
                        specify the <signal> to be sent
 -l, --list=[<signal>]  list all signal names, or convert one to a name
 -L, --table            list all signal names in a nice table

 -h, --help     display this help and exit
 -V, --version  output version information and exit

For more details see kill(1).
(tensorml) user@7ff6481e-fbd6-4dda-b12b-ac7b7c1ca4b2:~/axolotl$ sudo kill -9 7034
(tensorml) user@7ff6481e-fbd6-4dda-b12b-ac7b7c1ca4b2:~/axolotl$ sudo lsof /var/lib/dpkg/lock-frontend
(tensorml) user@7ff6481e-fbd6-4dda-b12b-ac7b7c1ca4b2:~/axolotl$

back on track.

@Glavin001
Copy link
Author

sudo apt install nvidia-cuda-toolkit
Reading package lists... Done
Building dependency tree       
Reading state information... Done
nvidia-cuda-toolkit is already the newest version (10.1.243-3).
The following packages were automatically installed and are no longer required:
  linux-headers-5.4.0-152 linux-headers-5.4.0-152-generic linux-image-5.4.0-152-generic linux-modules-5.4.0-152-generic
Use 'sudo apt autoremove' to remove them.
0 upgraded, 0 newly installed, 0 to remove and 37 not upgraded.
$ sudo apt-get install cuda-toolkit-12-2

this worked!

$ ls -l /usr/local/
total 36
drwxr-xr-x  2 root root 4096 Sep  5 00:48 bin
lrwxrwxrwx  1 root root   22 Sep  5 00:48 cuda -> /etc/alternatives/cuda
lrwxrwxrwx  1 root root   25 Sep  5 00:48 cuda-12 -> /etc/alternatives/cuda-12
drwxr-xr-x 15 root root 4096 Sep  5 00:48 cuda-12.2
drwxr-xr-x  2 root root 4096 Jun 19 21:39 etc
drwxr-xr-x  2 root root 4096 Jun 19 21:39 games
drwxr-xr-x  2 root root 4096 Jun 19 21:39 include
drwxr-xr-x  3 root root 4096 Jun 19 21:39 lib
lrwxrwxrwx  1 root root    9 Jun 19 21:39 man -> share/man
drwxr-xr-x  2 root root 4096 Jun 19 21:39 sbin
drwxr-xr-x  5 root root 4096 Jul  5 03:43 share
drwxr-xr-x  2 root root 4096 Jun 19 21:39 src

Now have cuda-12 installed in /usr/local/!

@Glavin001
Copy link
Author

Still out of sync:

$ nvidia-smi
Tue Sep  5 00:50:23 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090        Off | 00000000:05:00.0 Off |                  N/A |
| 42%   38C    P0             103W / 390W |      2MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243

@Glavin001
Copy link
Author

Append to ~/.bashrc the following:

CUDA_VERSION="12.2"
export PATH=/usr/local/cuda-${CUDA_VERSION}/bin${PATH:+:${PATH}}$
export LD_LIBRARY_PATH=/usr/local/cuda-${CUDA_VERSION}/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

And now it works!

$ source ~/.bashrc 
(tensorml) user@7ff6481e-fbd6-4dda-b12b-ac7b7c1ca4b2:~/axolotl$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment