- Inside a docker container
- Using NCCL and TCP or Shared file-system
- PyTorch version:
1.0.1.post2
- 2 Nodes / 3 GPUs
We need to run the container with --network=host
option
docker run \
-it \
--rm \
--runtime="nvidia" \
--name "${name}" \
--network=host \
--shm-size 16G \
${image} \
/bin/bash
We need to specify the correct network interface: NCCL_SOCKET_IFNAME=eno1
and specify Node 0 IP (192.168.3.22) and free port (23456).
On Node 0, start the 1 process:
NCCL_SOCKET_IFNAME=eno1 NCCL_DEBUG=INFO python mnist_dist.py --dist_method="tcp://192.168.3.22:23456" --dist_backend="nccl" --world_size=3 --rank=0 --gpu=0
On Node 0, start the 2 process:
NCCL_SOCKET_IFNAME=eno1 NCCL_DEBUG=INFO python mnist_dist.py --dist_method="tcp://192.168.3.22:23456" --dist_backend="nccl" --world_size=3 --rank=1 --gpu=1
On Node 1, start the 3 process:
NCCL_SOCKET_IFNAME=eno1 NCCL_DEBUG=INFO python mnist_dist.py --dist_method="tcp://192.168.3.22:23456" --dist_backend="nccl" --world_size=3 --rank=2 --gpu=0
- https://pytorch.org/docs/master/distributed.html
- https://docs.nvidia.com/deeplearning/sdk/nccl-developer-guide/docs/overview.html
World size, Rank, local rank concepts: