Skip to content

Instantly share code, notes, and snippets.

@qin-yu
Last active April 28, 2024 19:39
Show Gist options
  • Save qin-yu/d3619a68d209dd1feefd7385e43c3fc4 to your computer and use it in GitHub Desktop.
Save qin-yu/d3619a68d209dd1feefd7385e43c3fc4 to your computer and use it in GitHub Desktop.
Use TensorFlow with GPU support on Ubuntu

Use TensorFlow with GPU support on Ubuntu with Docker

Docker is the easiest way to run TensorFlow on a GPU since the host machine only requires the NVIDIA® driver (the NVIDIA® CUDA® Toolkit is not required).

Table of Contents

System tested

  • Ubuntu 18.04.1
  • NVRM 435.21
  • GCC 7.5.0
  • Docker 19.03.8

Install Docker Engine - Community (using the repository)

Set up the repository

  1. To allow apt to use a repository over HTTPS:
    $ sudo apt-get update
    $ sudo apt-get install \
        apt-transport-https \
        ca-certificates \
        curl \
        gnupg-agent \
        software-properties-common
  2. Add Docker’s official GPG key:
    $ curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
    $ sudo apt-key fingerprint 0EBFCD88
  3. Set up the stable repository:
    $ sudo add-apt-repository \
       "deb [arch=amd64] https://download.docker.com/linux/ubuntu \
       $(lsb_release -cs) \
       stable"

Install Docker Engine - Community

  1. Install the latest version of Docker Engine - Community and containerd:
    $ sudo apt-get update
    $ sudo apt-get install docker-ce docker-ce-cli containerd.io
  2. Verify that Docker Engine - Community is installed correctly:
    $ sudo docker run hello-world

The Docker daemon binds to a Unix socket instead of a TCP port. By default that Unix socket is owned by the user root and other users can only access it using sudo. The Docker daemon always runs as the root user. If you don’t want to preface the docker command with sudo, create a Unix group called docker and add users to it. When the Docker daemon starts, it creates a Unix socket accessible by members of the docker group.

  1. Create the docker group:
    $ sudo groupadd docker
  2. Add your user to the docker group:
    $ sudo usermod -aG docker $USER
  3. Log out and log back in so that your group membership is re-evaluated. On Linux, you can also run the following command to activate the changes to groups:
    $ newgrp docker
  4. Verify that you can run docker commands without sudo:
    $ docker run hello-world

For GPU support on Linux, install NVIDIA Docker support

Make sure you have installed the NVIDIA driver and Docker 19.03 for your Linux distribution Note that you do not need to install the CUDA toolkit on the host, but the driver needs to be installed.

  1. Verify driver version:
    $ /proc/driver/nvidia/version
    This could be
    NVRM version: NVIDIA UNIX x86_64 Kernel Module  435.21  Sun Aug 25 08:17:57 CDT 2019
    GCC version:  gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)
  2. Verify the CUDA Toolkit version:
    $ nvcc -V
    This could be Command 'nvcc' not found.

Install NVIDIA Container Toolkit

$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
$ curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
$ curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

$ sudo apt-get update && sudo apt-get install -y nvidia-container-toolkit
$ sudo systemctl restart docker

Usage

sudo if permission denied:

  • Test nvidia-smi with the latest official CUDA image
    $ docker run --gpus all nvidia/cuda:10.0-base nvidia-smi
  • Start a GPU enabled container on two GPUs
    $ docker run --gpus 2 nvidia/cuda:10.0-base nvidia-smi
  • Starting a GPU enabled container on specific GPUs
    $ docker run --gpus '"device=1,2"' nvidia/cuda:10.0-base nvidia-smi
    $ docker run --gpus '"device=UUID-ABCDEF,1"' nvidia/cuda:10.0-base nvidia-smi
    
  • Specifying a capability (graphics, compute, ...) for my container
    $ docker run --gpus all,capabilities=utility nvidia/cuda:10.0-base nvidia-smi
    Note this is rarely if ever used this way.
$ docker pull tensorflow/tensorflow:latest-gpu-py3-jupyter

To check what images are on the machine:

$ docker image ls

One should see the following output if he follows this gist:

REPOSITORY              TAG                      IMAGE ID            CREATED             SIZE
tensorflow/tensorflow   latest-gpu-py3-jupyter   ce8f7398433c        2 months ago        4.26GB
nvidia/cuda             10.0-base                841d44dd4b3c        4 months ago        110MB
hello-world             latest                   fce289e99eb9        15 months ago       1.84kB

Note if one attempts to run:

$ docker run -it --rm tensorflow/tensorflow \
   python -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))"

a new image tensorflow/tensorflow:latest will be downloaded:

REPOSITORY              TAG                      IMAGE ID            CREATED             SIZE
tensorflow/tensorflow   latest-gpu-py3-jupyter   ce8f7398433c        2 months ago        4.26GB
tensorflow/tensorflow   latest                   9bf93bf90865        2 months ago        2.47GB
nvidia/cuda             10.0-base                841d44dd4b3c        4 months ago        110MB
hello-world             latest                   fce289e99eb9        15 months ago       1.84kB

In general:

$ docker run [-it] [--rm] [-p hostPort:containerPort] tensorflow/tensorflow[:tag] [command]
  1. Start a bash shell session within a TensorFlow-configured container:
    $ docker run -it tensorflow/tensorflow bash
  2. To run a TensorFlow program developed on the host machine within a container, mount the host directory and change the container's working directory (-v hostDir:containerDir -w workDir):
    $ docker run -it --rm -v $PWD:/tmp -w /tmp tensorflow/tensorflow python ./script.py
  3. Start a Jupyter Notebook server using TensorFlow's nightly build with Python 3 support:
    $ docker run -it -p 8888:8888 tensorflow/tensorflow:nightly-py3-jupyter
  1. Check if a GPU is available:
    $ lspci | grep -i nvidia
  2. Verify your nvidia-docker installation:
    $ docker run --gpus all --rm nvidia/cuda nvidia-smi
  3. Download and run a GPU-enabled TensorFlow image:
    $ docker run --gpus all -it --rm tensorflow/tensorflow:latest-gpu \
        python -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))"
    The output is commented below.
  4. Use the latest TensorFlow GPU image to start a bash shell session in the container:
    $ dock run --gpus all -it tensorflow/tensorflow:latest-gpu bash

Consider a case where you have a directory source and that when you build the source code, the artifacts are saved into another directory, source/target/. You want the artifacts to be available to the container at /app/, and you want the container to get access to a new build each time you build the source on your development host. Use the following command to bind-mount the target/ directory into your container at /app/. Run the command from within the source directory. The $(pwd) sub-command expands to the current working directory on Linux or macOS hosts.

$ docker run -d \
    -it \
    --name msc2 \
    --mount type=bind,source="$(pwd)"/.,target=/mounteddir \
    tensorflow/tensorflow:latest-gpu-py3

-d means detached. To bring it to foreground, use docker attach CONTAINER, where CONTAINER is a custom name. Note this is more complex than needed to provide a more complete usage. The simplified command I use is, with the port specified for Jupyter notebook:

$ docker run -it \
    --name msc2 \
    -p 8888:8888 \
    --mount type=bind,source="$(pwd)"/.,target=/mounteddir \
    tensorflow/tensorflow:latest-gpu-py3

Run docker rename CONTAINER NEW_NAME to rename the container, and use docker rm CONTAINER to delete the unwanted ones.

The docker run command first creates a writeable container layer over the specified image, and then starts it using the specified command. That is, docker run is equivalent to the API /containers/create then /containers/(id)/start. A stopped container can be restarted with all its previous changes intact using docker start. See docker ps -a to view a list of all containers.[*]

Next time when you want to use the it, with the status being the same as when it exited [*]:

$ docker restart msc2
$ docker attach msc2

Now, in the docker container msc2, you can do pip install notebook and launch Jupyter notebook using [*]:

$ jupyter notebook --ip 0.0.0.0 --port 8888 --no-browser --allow-root

The jupyter notebook software will give an URL which can be used in browsers outside the Docker container since we specified port with -p.

@qin-yu
Copy link
Author

qin-yu commented Apr 3, 2020

$ sudo docker run --gpus all -it --rm tensorflow/tensorflow:latest-gpu \
    python -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))"

gives

Unable to find image 'tensorflow/tensorflow:latest-gpu' locally
latest-gpu: Pulling from tensorflow/tensorflow
7ddbc47eeb70: Already exists 
c1bbdc448b72: Already exists 
8c3b70e39044: Already exists 
45d437916d57: Already exists 
d8f1569ddae6: Already exists 
85386706b020: Already exists 
ee9b457b77d0: Already exists 
bebfcc1316f7: Already exists 
644140fd95a9: Already exists 
d6c0f989e873: Already exists 
e0c8121d4dcf: Pull complete 
3b08fd71d6c2: Pull complete 
a2cdbf2e693e: Pull complete 
ba62da0ce990: Pull complete 
fda5c033c1ca: Pull complete 
7aa2c0b26596: Pull complete 
Digest: sha256:f2fac496eee2170722aae5e5cf4254b193b811c190da7a965c78de88ec0329f6
Status: Downloaded newer image for tensorflow/tensorflow:latest-gpu
2020-04-03 03:52:30.183849: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer.so.6
2020-04-03 03:52:30.185514: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libnvinfer_plugin.so.6
2020-04-03 03:52:30.627318: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-04-03 03:52:30.641588: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-03 03:52:30.642050: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: 
pciBusID: 0000:02:00.0 name: GeForce GTX 1080 computeCapability: 6.1
coreClock: 1.7335GHz coreCount: 20 deviceMemorySize: 7.93GiB deviceMemoryBandwidth: 298.32GiB/s
2020-04-03 03:52:30.642080: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-04-03 03:52:30.642103: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-04-03 03:52:30.753780: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-04-03 03:52:30.790095: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-04-03 03:52:30.987332: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-04-03 03:52:31.015090: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-04-03 03:52:31.015273: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-04-03 03:52:31.015569: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-03 03:52:31.017409: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-03 03:52:31.019232: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-04-03 03:52:31.020124: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-04-03 03:52:31.056321: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3199980000 Hz
2020-04-03 03:52:31.057202: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x56327c7af800 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-04-03 03:52:31.057236: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-04-03 03:52:31.253642: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-03 03:52:31.255142: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x56327c7e22f0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-04-03 03:52:31.255204: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): GeForce GTX 1080, Compute Capability 6.1
2020-04-03 03:52:31.256161: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-03 03:52:31.257571: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: 
pciBusID: 0000:02:00.0 name: GeForce GTX 1080 computeCapability: 6.1
coreClock: 1.7335GHz coreCount: 20 deviceMemorySize: 7.93GiB deviceMemoryBandwidth: 298.32GiB/s
2020-04-03 03:52:31.257668: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-04-03 03:52:31.257705: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-04-03 03:52:31.257748: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-04-03 03:52:31.257786: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-04-03 03:52:31.257824: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-04-03 03:52:31.257860: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-04-03 03:52:31.257891: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-04-03 03:52:31.258102: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-03 03:52:31.259418: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-03 03:52:31.260595: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2020-04-03 03:52:31.260694: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-04-03 03:52:33.656769: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-04-03 03:52:33.656848: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102]      0 
2020-04-03 03:52:33.656873: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0:   N 
2020-04-03 03:52:33.657338: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-03 03:52:33.659115: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:981] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-04-03 03:52:33.660441: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6890 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080, pci bus id: 0000:02:00.0, compute capability: 6.1)
tf.Tensor(120.103516, shape=(), dtype=float32)

@qin-yu
Copy link
Author

qin-yu commented Apr 3, 2020

$ sudo docker run --gpus all -it tensorflow/tensorflow:latest-gpu bash

gives

________                               _______________                
___  __/__________________________________  ____/__  /________      __
__  /  _  _ \_  __ \_  ___/  __ \_  ___/_  /_   __  /_  __ \_ | /| / /
_  /   /  __/  / / /(__  )/ /_/ /  /   _  __/   _  / / /_/ /_ |/ |/ / 
/_/    \___//_/ /_//____/ \____//_/    /_/      /_/  \____/____/|__/


WARNING: You are running this container as root, which can cause new files in
mounted volumes to be created as the root user on your host machine.

To avoid this, run the container by specifying your user's userid:

$ docker run -u $(id -u):$(id -g) args...

root@efd3b1973ad3:/# exit
exit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment