santurini/torch-ddp.md

## torch-ddp.md

      
    Raw
  

              torch-ddp.md
            
          
    To launch a distributed training in torch with mpirun we have to:

Configure a passwordless ssh connection with the nodes
Setup the distributed environment inside the training script, in this case train.py
Launch the training from the MASTER node with mpirun

For the first step, this is the pipeline:
# generate a public/private ssh key and make sure to NOT insert a passphrase

ssh-keygen -t rsa

# copy public key 'id_rsa' on the MASTER and SLAVE

ssh-copy-id -i ~/.ssh/id_rsa.pub username@master && ssh-copy-id -i ~/.ssh/id_rsa.pub username@slave

# add key on MASTER node

ssh-add ~/.ssh/id_rsa

Then we need to correctly setup the training script in this way:

initialize distributed group

import torch.distributed as dist

rank = int(os.environ["OMPI_COMM_WORLD_RANK"])
local_rank = int(os.environ["OMPI_COMM_WORLD_LOCAL_RANK"])
world_size = int(os.environ["OMPI_COMM_WORLD_SIZE"])

dist.init_process_group(backend="nccl", rank=rank, world_size=world_size)


wrap model with DistributedDataParallel

ddp_model = torch.nn.parallel.DistributedDataParallel(
            model,
            device_ids=[local_rank],
            output_device=local_rank
        )        


use a distributed sampler for the DataLoader

train_sampler = DistributedSampler(dataset=train_ds)
val_sampler = DistributedSampler(dataset=val_ds)

train_dl = DataLoader(..., sampler=train_sampler)
val_dl = DataLoader(..., sampler=val_sampler)

To log in a distributed fashion you can follow one of the following strategies:
A) log only from the process with global rank 0:
if rank == 0:
  # code to log whatever

B) reduce results on process with global rank 0:
dist.reduce(tensor_name_to_reduce, dst=0, op=dist.ReduceOp.SUM) # in-place operation

tensor_name_to_reduce = tensor_name_to_reduce.item() / world_size # average for num_gpus

# code to log the aggregated tensor

Now to launch the training you can use the following command on the MASTER node:
 mpirun -np 2
        -H xxx.xxx.xxx.xxx:1,xxx.xxx.xxx.xxy:1 
        -x MASTER_ADDR=xxx.xxx.xxx.xxx 
        -x MASTER_PORT=1234 
        -x PATH=$PATH:/path/to/venv/my_env/bin 
        -bind-to none -map-by slot 
        python train.py --additional_script_args

the arguments are:

np: number of gpus to use, 2 in this case
H: list of hosts and available slots (:1, means one gpu), we can also link to a hostile (--hostfile)
x: additional variables, in this case master address & port and path to virtualenv that has to be the same on both machines
bind-to: no binding to use multithread
map-by: how to map the process, in this case we map per gpu