santurini/deepspeed-ddp.md

## deepspeed-ddp.md

      
    Raw
  

              deepspeed-ddp.md
            
          
    In this tutorial we assume to launch a distributed training on 2 nodes using DeepSpeed with the OpenMPI Launcher.

First of all DeepSpeed needs a passwordless ssh connection with all the nodes, MASTER included:

# generate a public/private ssh key and make sure to NOT insert a passphrase

ssh-keygen -t rsa

# copy public key 'id_rsa' on the MASTER and SLAVE

ssh-copy-id -i ~/.ssh/id_rsa.pub username@master && ssh-copy-id -i ~/.ssh/id_rsa.pub username@slave

# add the key on the MASTER (not sure if needed, but better be safe):

ssh-add ~/.ssh/id_rsa


We need a hostfile and a ds_config.json that look like this (do the required modifications depending on what you need, this are just examples):

----------------- hostfile -----------------

# it works both with ssh aliases or directly the ip-addresses, slots is the number of gpus on the node

master slots=1
slave slots=1 

--------------- ds_config.json --------------

{
  "train_micro_batch_size_per_gpu": 1,
  "gradient_accumulation_steps": 8,
  "optimizer": {
    "type": "Adam",
    "params": {
      "lr": 1e-4,
      "betas": [
        0.9,
        0.99
      ]
    }
  },
  "scheduler": {
    "type": "WarmupDecayLR",
    "params": {
      "total_num_steps": 500000,
      "warmup_min_lr": 0,
      "warmup_max_lr": 1e-4,
      "warmup_num_steps": 1000
    }
  }


Prepare correctly the training script:

# Initialize the distributed process

rank = int(os.environ["OMPI_COMM_WORLD_RANK"])
local_rank = int(os.environ["OMPI_COMM_WORLD_LOCAL_RANK"])
world_size = int(os.environ["OMPI_COMM_WORLD_SIZE"])

deepspeed.init_distributed(dist_backend="nccl", rank=rank, world_size=world_size)

# initialize the model engine that will take care of everything else 
# in this case I'm using a train and validation dataloader initialized 
# before with a distributed sampler

model_engine, optimizer, _, scheduler = deepspeed.initialize(
        args=args, model=model, model_parameters=create_moe_param_groups(model))
        
batch_size = model_engine.train_micro_batch_size_per_gpu()

train_sampler = DistributedSampler(dataset=train_ds)
val_sampler = DistributedSampler(dataset=val_ds)

train_dl = DataLoader(..., batch_size=batch_size, sampler=train_sampler)
val_dl = DataLoader(..., batch_size=batch_size,, sampler=val_sampler)

# compute and backpropagate the loss (in this case we use mixed precision)

y = model_engine(x.half())
loss = compute_loss(y_hat, y)

model_engine.backward(loss)
model_engine.step()

# to have a better validation assessment, aggregate metrics across devices in evaluation
# dist is definded as: import torch.distributed as dist

dist.reduce(loss, dst=0, op=dist.ReduceOp.SUM)
val_loss += loss.detach().item() / world_size


Finally this is the launch command for deepspeed with OpenMPI that has to be ran exclusively on the MASTER node:

deepspeed --hostfile=path/to/hostfile \
          --master_addr=xxx.xxx.xxx.xxx \
          --master_port=1234 \
          --launcher=openmpi \
          --launcher_args="-bind-to none -map-by slot -x PATH=$PATH:/path/to/venv/my_venv/bin" \
          train.py --deepspeed --deepspeed_config "path/to/ds_config.json"