distrib_train
function does all setup required for distributed training:
import torch
def distrib_train(gpu):
if gpu is None: return gpu
gpu = int(gpu)
torch.cuda.set_device(int(gpu))
torch.distributed.init_process_group(backend='nccl', init_method='env://')
return gpu
torch.distributed
provides an MPI-like interface for exchanging tensor data across multi-machine networks. It supports a few different backends and initialization methods.
We are using NVIDIA Collective Communications Library (NCCL) for the backend.