Skip to content

Instantly share code, notes, and snippets.

@vfdev-5
Created February 26, 2019 15:01
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save vfdev-5/e5c20e0ba4ed424c63429b491141d4cd to your computer and use it in GitHub Desktop.
Save vfdev-5/e5c20e0ba4ed424c63429b491141d4cd to your computer and use it in GitHub Desktop.
Notes on PyTorch distributed

Some notes on launching distributed computations with PyTorch

  • Inside a docker container
  • Using NCCL and TCP or Shared file-system
  • PyTorch version: 1.0.1.post2
  • 2 Nodes / 3 GPUs

Docker container

We need to run the container with --network=host option

docker run \
    -it \
    --rm \
    --runtime="nvidia" \
    --name "${name}" \
    --network=host \
    --shm-size 16G \
    ${image} \
    /bin/bash

Run the code

We need to specify the correct network interface: NCCL_SOCKET_IFNAME=eno1 and specify Node 0 IP (192.168.3.22) and free port (23456).

On Node 0, start the 1 process:

NCCL_SOCKET_IFNAME=eno1 NCCL_DEBUG=INFO python mnist_dist.py --dist_method="tcp://192.168.3.22:23456" --dist_backend="nccl" --world_size=3 --rank=0 --gpu=0

On Node 0, start the 2 process:

NCCL_SOCKET_IFNAME=eno1 NCCL_DEBUG=INFO python mnist_dist.py --dist_method="tcp://192.168.3.22:23456" --dist_backend="nccl" --world_size=3 --rank=1 --gpu=1

On Node 1, start the 3 process:

NCCL_SOCKET_IFNAME=eno1 NCCL_DEBUG=INFO python mnist_dist.py --dist_method="tcp://192.168.3.22:23456" --dist_backend="nccl" --world_size=3 --rank=2 --gpu=0

Refs

World size, Rank, local rank concepts:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment