vfdev-5/notes_pytorch_distributed.md

## notes_pytorch_distributed.md

      
    Raw
  

              notes_pytorch_distributed.md
            
          
    Some notes on launching distributed computations with PyTorch


Inside a docker container
Using NCCL and TCP or Shared file-system
PyTorch version: 1.0.1.post2
2 Nodes / 3 GPUs

Docker container

We need to run the container with --network=host option
docker run \
    -it \
    --rm \
    --runtime="nvidia" \
    --name "${name}" \
    --network=host \
    --shm-size 16G \
    ${image} \
    /bin/bash

https://github.com/horovod/horovod/blob/master/docs/docker.md#running-on-multiple-machines

Run the code

We need to specify the correct network interface: NCCL_SOCKET_IFNAME=eno1
and specify Node 0 IP (192.168.3.22) and free port (23456).
On Node 0, start the 1 process:
NCCL_SOCKET_IFNAME=eno1 NCCL_DEBUG=INFO python mnist_dist.py --dist_method="tcp://192.168.3.22:23456" --dist_backend="nccl" --world_size=3 --rank=0 --gpu=0
On Node 0, start the 2 process:
NCCL_SOCKET_IFNAME=eno1 NCCL_DEBUG=INFO python mnist_dist.py --dist_method="tcp://192.168.3.22:23456" --dist_backend="nccl" --world_size=3 --rank=1 --gpu=1
On Node 1, start the 3 process:
NCCL_SOCKET_IFNAME=eno1 NCCL_DEBUG=INFO python mnist_dist.py --dist_method="tcp://192.168.3.22:23456" --dist_backend="nccl" --world_size=3 --rank=2 --gpu=0
Refs


https://pytorch.org/docs/master/distributed.html
https://docs.nvidia.com/deeplearning/sdk/nccl-developer-guide/docs/overview.html

World size, Rank, local rank concepts:

https://github.com/horovod/horovod/blob/master/docs/concepts.md