Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Instructions for Docker swarm with GPUs

Setting up a Docker Swarm with GPUs

Installing Docker

Official instructions.

Add yourself to the docker group to be able to run containers as non-root (see Post-install steps for Linux).

sudo groupadd docker
sudo usermod -aG docker $USER

Verify with docker run hello-world.

Installing the NVidia Container Runtime

Official instructions.

Start by installing the appropriate NVidia drivers. Then continue to install NVidia Docker.

Verify with docker run --gpus all,capabilities=utility nvidia/cuda:10.0-base nvidia-smi.

Configuring Docker to work with your GPU(s)

The first step is to identify the GPU(s) available on your system. Docker will expose these as 'resources' to the swarm. This allows other nodes to place services (swarm-managed container deployments) on your machine.

These steps are currently for NVidia GPUs.

Docker identifies your GPU by its Universally Unique IDentifier (UUID). Find the GPU UUID for the GPU(s) in your machine.

nvidia-smi -a

A typical UUID looks like GPU-45cbf7b3-f919-7228-7a26-b06628ebefa1. Now, only take the first two dash-separated parts, e.g.: GPU-45cbf7b3.

Open up the Docker engine configuration file, typically at /etc/docker/daemon.json.

Add the GPU ID to the node-generic-resources. Make sure that the nvidia runtime is present and set the default-runtime to it. Make sure to keep other configuration options in-place, if they are there. Take care of the JSON syntax, which is not forgiving of single quotes and lagging commas.

{
  "runtimes": {
    "nvidia": {
      "path": "/usr/bin/nvidia-container-runtime",
      "runtimeArgs": []
    }
  },
  "default-runtime": "nvidia",
  "node-generic-resources": [
    "gpu=GPU-45cbf7b"
    ]
}

Now, make sure to enable GPU resource advertisting by adding or uncommenting the following in /etc/nvidia-container-runtime/config.toml

swarm-resource = "DOCKER_RESOURCE_GPU"

Restart the service.

sudo systemctl restart docker.service

Initializing the Docker Swarm

Initialize a new swarm on a manager-to-be.

docker swarm init

Add new nodes (slaves), or manager-nodes (shared masters). Run the following command on a node that is already part of the swarm:

docker swarm join-token (worker|manager)

Then, run the resulting command on a member-to-be.

Show who's in the swarm:

docker node ls

A first deployment

docker service create --replicas 1 \
  --name tensor-qs \
  --generic-resource "gpu=1" \
  tomlankhorst/tensorflow-quickstart

This deploys a TensorFlow quick start image, that follows the quick start.

Show active services:

docker service ls

Inspect the service

$ docker service inspect --pretty tensor-qs
ID:             vtjcl47xc630o6vndbup64c1i
Name:           tensor-qs
Service Mode:   Replicated
 Replicas:      1
Placement:
UpdateConfig:
 Parallelism:   1
 On failure:    pause
 Monitoring Period: 5s
 Max failure ratio: 0
 Update order:      stop-first
RollbackConfig:
 Parallelism:   1
 On failure:    pause
 Monitoring Period: 5s
 Max failure ratio: 0
 Rollback order:    stop-first
ContainerSpec:
 Image:         tomlankhorst/tensorflow-quickstart:latest@sha256:1f793df87f00478d0c41ccc7e6177f9a214a5d3508009995447f3f25b45496fb
 Init:          false
Resources:
Endpoint Mode:  vip

Show the logs

$ docker service logs tensor-qs
...
tensor-qs.1.3f9jl1emwe9l@tlws    | 2020-03-16 08:45:15.495159: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
tensor-qs.1.3f9jl1emwe9l@tlws    | 2020-03-16 08:45:15.621767: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
tensor-qs.1.3f9jl1emwe9l@tlws    | Epoch 1, Loss: 0.132665216923, Accuracy: 95.9766693115, Test Loss: 0.0573637597263, Test Accuracy: 98.1399993896
tensor-qs.1.3f9jl1emwe9l@tlws    | Epoch 2, Loss: 0.0415383689106, Accuracy: 98.6949996948, Test Loss: 0.0489368513227, Test Accuracy: 98.3499984741
tensor-qs.1.3f9jl1emwe9l@tlws    | Epoch 3, Loss: 0.0211332384497, Accuracy: 99.3150024414, Test Loss: 0.0521399155259, Test Accuracy: 98.2900009155
tensor-qs.1.3f9jl1emwe9l@tlws    | Epoch 4, Loss: 0.0140329506248, Accuracy: 99.5716705322, Test Loss: 0.053688980639, Test Accuracy: 98.4700012207
tensor-qs.1.3f9jl1emwe9l@tlws    | Epoch 5, Loss: 0.00931495986879, Accuracy: 99.7116699219, Test Loss: 0.0681483447552, Test Accuracy: 98.1500015259
@mdailey

This comment has been minimized.

Copy link

@mdailey mdailey commented May 31, 2020

It seems that the swarm-resource option is now deprecated in the latest nvidia-container-runtime package. On Ubuntu 16.04 with nvidia-container-runtime package version 3.2.0-1 and nvidia-container-toolkit package version 1.1.1-1, uncommenting the swarm-resource line in /etc/nvidia-container-runtime/config.toml breaks my swarm services.

@RafaelWO

This comment has been minimized.

Copy link

@RafaelWO RafaelWO commented Aug 11, 2020

I got this solution to work with some changes:

  1. Change "gpu=GPU-45cbf7b" to "NVIDIA-GPU=GPU-45cbf7b" in the file /etc/docker/daemon.json
  2. Start the service with the arg --generic-resource "NVIDIA-GPU=0"

References:
https://docs.docker.com/engine/reference/commandline/dockerd/#miscellaneous-options
https://docs.docker.com/engine/reference/commandline/service_create/#create-services-requesting-generic-resources

@julienschuermans

This comment has been minimized.

Copy link

@julienschuermans julienschuermans commented Feb 22, 2021

I got this solution to work with some changes:

  1. Change "gpu=GPU-45cbf7b" to "NVIDIA-GPU=GPU-45cbf7b" in the file /etc/docker/daemon.json
  2. Start the service with the arg --generic-resource "NVIDIA-GPU=0"

References:
https://docs.docker.com/engine/reference/commandline/dockerd/#miscellaneous-options
https://docs.docker.com/engine/reference/commandline/service_create/#create-services-requesting-generic-resources

Thanks @RafaelWO, this worked for me!

@maaft

This comment has been minimized.

Copy link

@maaft maaft commented Mar 31, 2021

Hi! I have multiple GPUs on my server and added 2 out of 8 to node-generic-ressources in /etc/docker/daemon.json.

When I deploy my image with: docker service create --replicas 2 --name swarm-test --generic-resource "NVIDIA-GPU=1" swarm-test both containers use the same GPU.

Furthermore, nvidia-smi still shows all 8 GPUs (although only 2 are present in daemon.json). Is this file somehow ignored?

Instead I want each replica to use a dedicated GPU. How can I achieve this?

@nanotower

This comment has been minimized.

Copy link

@nanotower nanotower commented Apr 16, 2021

Swarm with nvidia it´s a mess. Poor documentation and even today we haven´t any straight steps to make it work properly.

I have two instances in google compute engine, both with nvidia tesla t4. Suddenly one doesn´t work. Cuda in swarm is gone. The other one, with exactly the same config, it´s working. I have checked nvidia uuid "with nvidia-smi -a" and both have changed it. daemon.json has an older uuid in both vms, but one is working and the other don´t. ¿Can someone explain it?
I can understand that google can change the hardware across start and stop cycles. But why it´s working if daemon has a different uuid?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment