Last active
March 17, 2025 06:40
Revisions
-
tomlankhorst revised this gist
Mar 16, 2020 . 1 changed file with 6 additions and 0 deletions.There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -66,6 +66,12 @@ Take care of the JSON syntax, which is not forgiving of single quotes and laggin } ``` Now, make sure to enable GPU resource advertisting by adding or uncommenting the following in `/etc/nvidia-container-runtime/config.toml` ``` swarm-resource = "DOCKER_RESOURCE_GPU" ``` Restart the service. ``` -
tomlankhorst created this gist
Mar 16, 2020 .There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode charactersOriginal file line number Diff line number Diff line change @@ -0,0 +1,155 @@ Setting up a Docker Swarm with GPUs ===== Installing Docker ---- [Official instructions](https://docs.docker.com/install/linux/docker-ce/ubuntu/). Add yourself to the `docker` group to be able to run containers as non-root (see [Post-install steps for Linux](https://docs.docker.com/install/linux/linux-postinstall/)). ``` sudo groupadd docker sudo usermod -aG docker $USER ``` Verify with `docker run hello-world`. Installing the NVidia Container Runtime ---- [Official instructions](https://github.com/NVIDIA/nvidia-docker). Start by installing the appropriate NVidia drivers. Then continue to install NVidia Docker. Verify with `docker run --gpus all,capabilities=utility nvidia/cuda:10.0-base nvidia-smi`. Configuring Docker to work with your GPU(s) ---- The first step is to identify the GPU(s) available on your system. Docker will expose these as 'resources' to the swarm. This allows other nodes to place services (swarm-managed container deployments) on your machine. _These steps are currently for NVidia GPUs._ Docker identifies your GPU by its Universally Unique IDentifier (UUID). Find the GPU UUID for the GPU(s) in your machine. ``` nvidia-smi -a ``` A typical UUID looks like `GPU-45cbf7b3-f919-7228-7a26-b06628ebefa1`. Now, only take the first two dash-separated parts, e.g.: `GPU-45cbf7b3`. Open up the Docker engine configuration file, typically at `/etc/docker/daemon.json`. Add the GPU ID to the `node-generic-resources`. Make sure that the `nvidia` runtime is present and set the `default-runtime` to it. Make sure to keep other configuration options in-place, if they are there. Take care of the JSON syntax, which is not forgiving of single quotes and lagging commas. ```json { "runtimes": { "nvidia": { "path": "/usr/bin/nvidia-container-runtime", "runtimeArgs": [] } }, "default-runtime": "nvidia", "node-generic-resources": [ "gpu=GPU-45cbf7b" ] } ``` Restart the service. ``` sudo systemctl restart docker.service ``` Initializing the Docker Swarm ---- Initialize a new swarm on a manager-to-be. ``` docker swarm init ``` Add new nodes (slaves), or manager-nodes (shared masters). Run the following command on a node that is already part of the swarm: ``` docker swarm join-token (worker|manager) ``` Then, run the resulting command on a member-to-be. Show who's in the swarm: ``` docker node ls ``` A first deployment --- ``` docker service create --replicas 1 \ --name tensor-qs \ --generic-resource "gpu=1" \ tomlankhorst/tensorflow-quickstart ``` This deploys [a TensorFlow quick start image](https://hub.docker.com/r/tomlankhorst/tensorflow-quickstart), that follows [the quick start](https://www.tensorflow.org/tutorials/quickstart/advanced). Show active services: ``` docker service ls ``` Inspect the service ``` $ docker service inspect --pretty tensor-qs ID: vtjcl47xc630o6vndbup64c1i Name: tensor-qs Service Mode: Replicated Replicas: 1 Placement: UpdateConfig: Parallelism: 1 On failure: pause Monitoring Period: 5s Max failure ratio: 0 Update order: stop-first RollbackConfig: Parallelism: 1 On failure: pause Monitoring Period: 5s Max failure ratio: 0 Rollback order: stop-first ContainerSpec: Image: tomlankhorst/tensorflow-quickstart:latest@sha256:1f793df87f00478d0c41ccc7e6177f9a214a5d3508009995447f3f25b45496fb Init: false Resources: Endpoint Mode: vip ``` Show the logs ``` $ docker service logs tensor-qs ... tensor-qs.1.3f9jl1emwe9l@tlws | 2020-03-16 08:45:15.495159: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0 tensor-qs.1.3f9jl1emwe9l@tlws | 2020-03-16 08:45:15.621767: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 tensor-qs.1.3f9jl1emwe9l@tlws | Epoch 1, Loss: 0.132665216923, Accuracy: 95.9766693115, Test Loss: 0.0573637597263, Test Accuracy: 98.1399993896 tensor-qs.1.3f9jl1emwe9l@tlws | Epoch 2, Loss: 0.0415383689106, Accuracy: 98.6949996948, Test Loss: 0.0489368513227, Test Accuracy: 98.3499984741 tensor-qs.1.3f9jl1emwe9l@tlws | Epoch 3, Loss: 0.0211332384497, Accuracy: 99.3150024414, Test Loss: 0.0521399155259, Test Accuracy: 98.2900009155 tensor-qs.1.3f9jl1emwe9l@tlws | Epoch 4, Loss: 0.0140329506248, Accuracy: 99.5716705322, Test Loss: 0.053688980639, Test Accuracy: 98.4700012207 tensor-qs.1.3f9jl1emwe9l@tlws | Epoch 5, Loss: 0.00931495986879, Accuracy: 99.7116699219, Test Loss: 0.0681483447552, Test Accuracy: 98.1500015259 ```