tomlankhorst/docker-swarm-gpu.md

## 6 changes: 6 additions & 0 deletions docker-swarm-gpu.md
@@ -66,6 +66,12 @@ Take care of the JSON syntax, which is not forgiving of single quotes and laggin

    }
}

    ```
```


    Now, make sure to enable GPU resource advertisting by adding or uncommenting the following in `/etc/nvidia-container-runtime/config.toml`
Now, make sure to enable GPU resource advertisting by adding or uncommenting the following in `/etc/nvidia-container-runtime/config.toml`


    ```
```

    swarm-resource = "DOCKER_RESOURCE_GPU"
swarm-resource = "DOCKER_RESOURCE_GPU"

    ```
```


    Restart the service.
Restart the service.


    ```
```


## 155 changes: 155 additions & 0 deletions docker-swarm-gpu.md
@@ -0,0 +1,155 @@

    Setting up a Docker Swarm with GPUs
Setting up a Docker Swarm with GPUs

    =====
=====


    Installing Docker
Installing Docker

    ----
----


    [Official instructions](https://docs.docker.com/install/linux/docker-ce/ubuntu/).
[Official instructions](https://docs.docker.com/install/linux/docker-ce/ubuntu/).


    Add yourself to the `docker` group to be able to run containers as non-root (see [Post-install steps for Linux](https://docs.docker.com/install/linux/linux-postinstall/)).
Add yourself to the `docker` group to be able to run containers as non-root (see [Post-install steps for Linux](https://docs.docker.com/install/linux/linux-postinstall/)).


    ```
```

    sudo groupadd docker
sudo groupadd docker

    sudo usermod -aG docker $USER
sudo usermod -aG docker $USER

    ```
```


    Verify with `docker run hello-world`.
Verify with `docker run hello-world`.


    Installing the NVidia Container Runtime
Installing the NVidia Container Runtime

    ----
----


    [Official instructions](https://github.com/NVIDIA/nvidia-docker).
[Official instructions](https://github.com/NVIDIA/nvidia-docker).


    Start by installing the appropriate NVidia drivers. Then continue to install NVidia Docker.
Start by installing the appropriate NVidia drivers. Then continue to install NVidia Docker.


    Verify with `docker run --gpus all,capabilities=utility nvidia/cuda:10.0-base nvidia-smi`.
Verify with `docker run --gpus all,capabilities=utility nvidia/cuda:10.0-base nvidia-smi`.


    Configuring Docker to work with your GPU(s)
Configuring Docker to work with your GPU(s)

    ----
----


    The first step is to identify the GPU(s) available on your system.
The first step is to identify the GPU(s) available on your system.

    Docker will expose these as 'resources' to the swarm.
Docker will expose these as 'resources' to the swarm.

    This allows other nodes to place services (swarm-managed container deployments) on your machine.
This allows other nodes to place services (swarm-managed container deployments) on your machine.


    _These steps are currently for NVidia GPUs._
_These steps are currently for NVidia GPUs._


    Docker identifies your GPU by its Universally Unique IDentifier (UUID).
Docker identifies your GPU by its Universally Unique IDentifier (UUID).

    Find the GPU UUID for the GPU(s) in your machine.
Find the GPU UUID for the GPU(s) in your machine.


    ```
```

    nvidia-smi -a
nvidia-smi -a

    ```
```


    A typical UUID looks like `GPU-45cbf7b3-f919-7228-7a26-b06628ebefa1`.
A typical UUID looks like `GPU-45cbf7b3-f919-7228-7a26-b06628ebefa1`.

    Now, only take the first two dash-separated parts, e.g.: `GPU-45cbf7b3`.
Now, only take the first two dash-separated parts, e.g.: `GPU-45cbf7b3`.


    Open up the Docker engine configuration file, typically at `/etc/docker/daemon.json`.
Open up the Docker engine configuration file, typically at `/etc/docker/daemon.json`.


    Add the GPU ID to the `node-generic-resources`.
Add the GPU ID to the `node-generic-resources`.

    Make sure that the `nvidia` runtime is present and set the `default-runtime` to it.
Make sure that the `nvidia` runtime is present and set the `default-runtime` to it.

    Make sure to keep other configuration options in-place, if they are there.
Make sure to keep other configuration options in-place, if they are there.

    Take care of the JSON syntax, which is not forgiving of single quotes and lagging commas.
Take care of the JSON syntax, which is not forgiving of single quotes and lagging commas.


    ```json
```json

    {
{

      "runtimes": {
  "runtimes": {

        "nvidia": {
    "nvidia": {

          "path": "/usr/bin/nvidia-container-runtime",
      "path": "/usr/bin/nvidia-container-runtime",

          "runtimeArgs": []
      "runtimeArgs": []

        }
    }

      },
  },

      "default-runtime": "nvidia",
  "default-runtime": "nvidia",

      "node-generic-resources": [
  "node-generic-resources": [

        "gpu=GPU-45cbf7b"
    "gpu=GPU-45cbf7b"

        ]
    ]

    }
}

    ```
```


    Restart the service.
Restart the service.


    ```
```

    sudo systemctl restart docker.service
sudo systemctl restart docker.service

    ```
```


    Initializing the Docker Swarm
Initializing the Docker Swarm

    ----
----


    Initialize a new swarm on a manager-to-be.
Initialize a new swarm on a manager-to-be.


    ```
```

    docker swarm init
docker swarm init

    ```
```


    Add new nodes (slaves), or manager-nodes (shared masters).
Add new nodes (slaves), or manager-nodes (shared masters).

    Run the following command on a node that is already part of the swarm:
Run the following command on a node that is already part of the swarm:


    ```
```

    docker swarm join-token (worker|manager)
docker swarm join-token (worker|manager)

    ```
```


    Then, run the resulting command on a member-to-be.
Then, run the resulting command on a member-to-be.


    Show who's in the swarm:
Show who's in the swarm:

    ```
```

    docker node ls
docker node ls

    ```
```


    A first deployment
A first deployment

    ---
---


    ```
```

    docker service create --replicas 1 \
docker service create --replicas 1 \

      --name tensor-qs \
  --name tensor-qs \

      --generic-resource "gpu=1" \
  --generic-resource "gpu=1" \

      tomlankhorst/tensorflow-quickstart
  tomlankhorst/tensorflow-quickstart

    ```
```


    This deploys [a TensorFlow quick start image](https://hub.docker.com/r/tomlankhorst/tensorflow-quickstart), that follows [the quick start](https://www.tensorflow.org/tutorials/quickstart/advanced).
This deploys [a TensorFlow quick start image](https://hub.docker.com/r/tomlankhorst/tensorflow-quickstart), that follows [the quick start](https://www.tensorflow.org/tutorials/quickstart/advanced).


    Show active services:
Show active services:


    ```
```

    docker service ls
docker service ls

    ```
```


    Inspect the service
Inspect the service


    ```
```

    $ docker service inspect --pretty tensor-qs
$ docker service inspect --pretty tensor-qs

    ID:             vtjcl47xc630o6vndbup64c1i
ID:             vtjcl47xc630o6vndbup64c1i

    Name:           tensor-qs
Name:           tensor-qs

    Service Mode:   Replicated
Service Mode:   Replicated

     Replicas:      1
 Replicas:      1

    Placement:
Placement:

    UpdateConfig:
UpdateConfig:

     Parallelism:   1
 Parallelism:   1

     On failure:    pause
 On failure:    pause

     Monitoring Period: 5s
 Monitoring Period: 5s

     Max failure ratio: 0
 Max failure ratio: 0

     Update order:      stop-first
 Update order:      stop-first

    RollbackConfig:
RollbackConfig:

     Parallelism:   1
 Parallelism:   1

     On failure:    pause
 On failure:    pause

     Monitoring Period: 5s
 Monitoring Period: 5s

     Max failure ratio: 0
 Max failure ratio: 0

     Rollback order:    stop-first
 Rollback order:    stop-first

    ContainerSpec:
ContainerSpec:

     Image:         tomlankhorst/tensorflow-quickstart:latest@sha256:1f793df87f00478d0c41ccc7e6177f9a214a5d3508009995447f3f25b45496fb
 Image:         tomlankhorst/tensorflow-quickstart:latest@sha256:1f793df87f00478d0c41ccc7e6177f9a214a5d3508009995447f3f25b45496fb

     Init:          false
 Init:          false

    Resources:
Resources:

    Endpoint Mode:  vip
Endpoint Mode:  vip

    ```
```


    Show the logs
Show the logs

    ```
```

    $ docker service logs tensor-qs
$ docker service logs tensor-qs

    ...
...

    tensor-qs.1.3f9jl1emwe9l@tlws    | 2020-03-16 08:45:15.495159: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
tensor-qs.1.3f9jl1emwe9l@tlws    | 2020-03-16 08:45:15.495159: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0

    tensor-qs.1.3f9jl1emwe9l@tlws    | 2020-03-16 08:45:15.621767: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
tensor-qs.1.3f9jl1emwe9l@tlws    | 2020-03-16 08:45:15.621767: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7

    tensor-qs.1.3f9jl1emwe9l@tlws    | Epoch 1, Loss: 0.132665216923, Accuracy: 95.9766693115, Test Loss: 0.0573637597263, Test Accuracy: 98.1399993896
tensor-qs.1.3f9jl1emwe9l@tlws    | Epoch 1, Loss: 0.132665216923, Accuracy: 95.9766693115, Test Loss: 0.0573637597263, Test Accuracy: 98.1399993896

    tensor-qs.1.3f9jl1emwe9l@tlws    | Epoch 2, Loss: 0.0415383689106, Accuracy: 98.6949996948, Test Loss: 0.0489368513227, Test Accuracy: 98.3499984741
tensor-qs.1.3f9jl1emwe9l@tlws    | Epoch 2, Loss: 0.0415383689106, Accuracy: 98.6949996948, Test Loss: 0.0489368513227, Test Accuracy: 98.3499984741

    tensor-qs.1.3f9jl1emwe9l@tlws    | Epoch 3, Loss: 0.0211332384497, Accuracy: 99.3150024414, Test Loss: 0.0521399155259, Test Accuracy: 98.2900009155
tensor-qs.1.3f9jl1emwe9l@tlws    | Epoch 3, Loss: 0.0211332384497, Accuracy: 99.3150024414, Test Loss: 0.0521399155259, Test Accuracy: 98.2900009155

    tensor-qs.1.3f9jl1emwe9l@tlws    | Epoch 4, Loss: 0.0140329506248, Accuracy: 99.5716705322, Test Loss: 0.053688980639, Test Accuracy: 98.4700012207
tensor-qs.1.3f9jl1emwe9l@tlws    | Epoch 4, Loss: 0.0140329506248, Accuracy: 99.5716705322, Test Loss: 0.053688980639, Test Accuracy: 98.4700012207

    tensor-qs.1.3f9jl1emwe9l@tlws    | Epoch 5, Loss: 0.00931495986879, Accuracy: 99.7116699219, Test Loss: 0.0681483447552, Test Accuracy: 98.1500015259
tensor-qs.1.3f9jl1emwe9l@tlws    | Epoch 5, Loss: 0.00931495986879, Accuracy: 99.7116699219, Test Loss: 0.0681483447552, Test Accuracy: 98.1500015259

    ```
```