cedrickchee/pytorch_distributed.md

## pytorch_distributed.md

      
    Raw
  

              pytorch_distributed.md
            
          
    distrib_train function does all setup required for distributed training:
import torch

def distrib_train(gpu):
    if gpu is None: return gpu
    gpu = int(gpu)
    torch.cuda.set_device(int(gpu))
    torch.distributed.init_process_group(backend='nccl', init_method='env://')
    return gpu
torch.distributed provides an MPI-like interface for exchanging tensor data across multi-machine networks. It supports a few different backends and initialization methods.
We are using NVIDIA Collective Communications Library (NCCL) for the backend.