Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 4 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save raulqf/1733e64abf5c850805f32dd3d0f13c32 to your computer and use it in GitHub Desktop.
Save raulqf/1733e64abf5c850805f32dd3d0f13c32 to your computer and use it in GitHub Desktop.
How to install multiple Tensorflow - CUDA versions on the same machine

How to install multiple Tensorflow - CUDA versions on the same machine

As Tensorflow is continuously evolving, it is normal to find a situation in which you require multiple versions of Tensorflow to coexist on the same machine. Those versions can be different enough to have different CUDA library dependencies. In this case, you can be tempted to upgrade to the latest release but maybe some of your solutions are still in production or just there are not more holes in your calendar.

In this gist I will cover how to install several CUDA libraries to support different tensorflow verions. However, there are some red lines that you have to respect as the GCC versions, that must be the same, and the nvidia drivers that must support the target CUDA versions. You can check that information in the Tensoroflow website.

The basic idea is to install the CUDA libraries and abuse of the linux system to find the correct libraries when executing the target tensorflow version. As you will know the LD_LIBRARY_PATH is an environment variable that defines those places where the system must locate the libraries when running an executable. The order of the folders defined in this variable matters, so if the target library is not found in the first preference, the system will check the following one up to find it or just check the entire list.

Install the target CUDA toolkit verions

Go to NVIDIA developer website and dowload the target CUDA Toolkits. Be aware of the CUDA Toolkits installations as it will try to upgrade your nvidia drivers. I recommend to install the oldest one first and leave the most updated to the end.

By default, cuda libraries are installed in linux in /usr/local. So, once you have finishing installing all the cuda versions you will find a respective folders for each one of the version in conjunction with a symbolic link with the name cuda pointing to the latest installed cuda toolkit. In my case, I have installed versions 10.0 and 10.1, so my /usr/local lists the following:

...
cuda -> /usr/local/cuda-10.1
cuda-10.0
cuda-10.1
...

Install the respective CUDNN libraries

Each cuda installation must be independent and it will be required to install the respective CUDNN library versions even if they are the same. So, proceed to dowload the respective CUDNN libraries for linux from the developer envidia websited and remember that the login is compulsory. The installation is just move the libraries to the target cuda subolders. For instance, ...

$ tar xvf cudnn-10.0-linux-x64-v6.0.tgz
$ sudo cp -P cuda/lib64/* /usr/local/cuda/lib64/
$ sudo cp cuda/include/* /usr/local/cuda/include/

Configure LD_LIBRARY_PATH

Finally, you must configure the LD_LIBRARY_PATH including all the cuda directories to find the respective libraries. I use to define that envionment variable in shell bash script "~/.bashrc". So, I recommend to modify it and include the following lines to configure it. For instance:

  #nvidia
  export PATH=/usr/local/cuda-10.0/bin${PATH:+:${PATH}}
  export LD_LIBRARY_PATH=/usr/local/cuda-10.0/lib64:/usr/local/cuda-10.1/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}
  export LD_LIBRARY_PATH=/usr/local/cuda-10.0/extras/CUPTI/lib64:/usr/local/cuda-10.1/extras/CUPTI/lib64:$LD_LIBRARY_PATH

Remember to update your loaded libraries using the following command:

sudo ldconfig

**Posible errors: **

It is likely that you have mismatch problems with the nvidia drivers modules. It is necessary to reload the nvidia modules for the system since it must refresh the initramfs and reload the modules. You can simply reboot the machine but maybe you are working on a server and that option is not on the table. If you cannot restart the machine you must unload and reload the modules. This problem is solved in stackoverflow and I include the solution to have more self-contained gist.

List of the nvidia drivers modules loaded by the system:

lsmod | grep nvidia

Typical modules you may get:

nvidia_uvm            634880  8
nvidia_drm             53248  0
nvidia_modeset        790528  1 nvidia_drm
nvidia              12312576  86 nvidia_modeset,nvidia_uvm

Now, unload the modules. The goal is to have all the modules unloaded specially nvidia:

sudo rmmod nvidia_drm
sudo rmmod nvidia_modeset
sudo rmmod nvidia_uvm
sudo rmmod nvidia

If you have problems unloading some of the modules because they are in use, you must kill the processes using that module. The error message you may get is: "rmmod: ERROR: Module nvidia is in use". So, first list the processes using:

sudo lsof /dev/nvidia*

And kill all the found processes. If you succeed, the outcome must be empty when typing:

lsmod | grep nvidia

Now, if you test nvidia-smi, it must work flawlessly.

Source

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment