Add yourself to the docker
group to be able to run containers as non-root (see Post-install steps for Linux).
sudo groupadd docker
sudo usermod -aG docker $USER
Verify with docker run hello-world
.
Start by installing the appropriate NVidia drivers. Then continue to install NVidia Docker.
Verify with docker run --gpus all,capabilities=utility nvidia/cuda:10.0-base nvidia-smi
.
The first step is to identify the GPU(s) available on your system. Docker will expose these as 'resources' to the swarm. This allows other nodes to place services (swarm-managed container deployments) on your machine.
These steps are currently for NVidia GPUs.
Docker identifies your GPU by its Universally Unique IDentifier (UUID). Find the GPU UUID for the GPU(s) in your machine.
nvidia-smi -a
A typical UUID looks like GPU-45cbf7b3-f919-7228-7a26-b06628ebefa1
.
Now, only take the first two dash-separated parts, e.g.: GPU-45cbf7b3
.
Open up the Docker engine configuration file, typically at /etc/docker/daemon.json
.
Add the GPU ID to the node-generic-resources
.
Make sure that the nvidia
runtime is present and set the default-runtime
to it.
Make sure to keep other configuration options in-place, if they are there.
Take care of the JSON syntax, which is not forgiving of single quotes and lagging commas.
{
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
},
"default-runtime": "nvidia",
"node-generic-resources": [
"gpu=GPU-45cbf7b"
]
}
Now, make sure to enable GPU resource advertisting by adding or uncommenting the following in /etc/nvidia-container-runtime/config.toml
swarm-resource = "DOCKER_RESOURCE_GPU"
Restart the service.
sudo systemctl restart docker.service
Initialize a new swarm on a manager-to-be.
docker swarm init
Add new nodes (slaves), or manager-nodes (shared masters). Run the following command on a node that is already part of the swarm:
docker swarm join-token (worker|manager)
Then, run the resulting command on a member-to-be.
Show who's in the swarm:
docker node ls
docker service create --replicas 1 \
--name tensor-qs \
--generic-resource "gpu=1" \
tomlankhorst/tensorflow-quickstart
This deploys a TensorFlow quick start image, that follows the quick start.
Show active services:
docker service ls
Inspect the service
$ docker service inspect --pretty tensor-qs
ID: vtjcl47xc630o6vndbup64c1i
Name: tensor-qs
Service Mode: Replicated
Replicas: 1
Placement:
UpdateConfig:
Parallelism: 1
On failure: pause
Monitoring Period: 5s
Max failure ratio: 0
Update order: stop-first
RollbackConfig:
Parallelism: 1
On failure: pause
Monitoring Period: 5s
Max failure ratio: 0
Rollback order: stop-first
ContainerSpec:
Image: tomlankhorst/tensorflow-quickstart:latest@sha256:1f793df87f00478d0c41ccc7e6177f9a214a5d3508009995447f3f25b45496fb
Init: false
Resources:
Endpoint Mode: vip
Show the logs
$ docker service logs tensor-qs
...
tensor-qs.1.3f9jl1emwe9l@tlws | 2020-03-16 08:45:15.495159: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
tensor-qs.1.3f9jl1emwe9l@tlws | 2020-03-16 08:45:15.621767: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
tensor-qs.1.3f9jl1emwe9l@tlws | Epoch 1, Loss: 0.132665216923, Accuracy: 95.9766693115, Test Loss: 0.0573637597263, Test Accuracy: 98.1399993896
tensor-qs.1.3f9jl1emwe9l@tlws | Epoch 2, Loss: 0.0415383689106, Accuracy: 98.6949996948, Test Loss: 0.0489368513227, Test Accuracy: 98.3499984741
tensor-qs.1.3f9jl1emwe9l@tlws | Epoch 3, Loss: 0.0211332384497, Accuracy: 99.3150024414, Test Loss: 0.0521399155259, Test Accuracy: 98.2900009155
tensor-qs.1.3f9jl1emwe9l@tlws | Epoch 4, Loss: 0.0140329506248, Accuracy: 99.5716705322, Test Loss: 0.053688980639, Test Accuracy: 98.4700012207
tensor-qs.1.3f9jl1emwe9l@tlws | Epoch 5, Loss: 0.00931495986879, Accuracy: 99.7116699219, Test Loss: 0.0681483447552, Test Accuracy: 98.1500015259
@lyze237 you can share GPUs across containers by not requesting them as resources--which can allocate a resource to only a single container--but just running them without declaring resources but using all GPUs (all GPUs are seen by default in all containers on a node) and then if you want to limit which ones a container uses specify the same
NVIDIA_VISIBLE_DEVICE
numbers as environment variables for those containers (assuming you don't want the containers to use all the GPUs). This would be Solution 1 that I wrote up here: https://gist.github.com/coltonbh/374c415517dbeb4a6aa92f462b9eb287