Skip to content

Instantly share code, notes, and snippets.

@santurini
Last active May 22, 2023 08:32
Show Gist options
  • Save santurini/e6c10608448aa503a211731e8e6f306d to your computer and use it in GitHub Desktop.
Save santurini/e6c10608448aa503a211731e8e6f306d to your computer and use it in GitHub Desktop.
DeepSpeed Multi-node Training Setup

In this tutorial we assume to launch a distributed training on 2 nodes using DeepSpeed with the OpenMPI Launcher.

  1. First of all DeepSpeed needs a passwordless ssh connection with all the nodes, MASTER included:
# generate a public/private ssh key and make sure to NOT insert a passphrase

ssh-keygen -t rsa

# copy public key 'id_rsa' on the MASTER and SLAVE

ssh-copy-id -i ~/.ssh/id_rsa.pub username@master && ssh-copy-id -i ~/.ssh/id_rsa.pub username@slave

# add the key on the MASTER (not sure if needed, but better be safe):

ssh-add ~/.ssh/id_rsa
  1. We need a hostfile and a ds_config.json that look like this (do the required modifications depending on what you need, this are just examples):
----------------- hostfile -----------------

# it works both with ssh aliases or directly the ip-addresses, slots is the number of gpus on the node

master slots=1
slave slots=1 

--------------- ds_config.json --------------

{
  "train_micro_batch_size_per_gpu": 1,
  "gradient_accumulation_steps": 8,
  "optimizer": {
    "type": "Adam",
    "params": {
      "lr": 1e-4,
      "betas": [
        0.9,
        0.99
      ]
    }
  },
  "scheduler": {
    "type": "WarmupDecayLR",
    "params": {
      "total_num_steps": 500000,
      "warmup_min_lr": 0,
      "warmup_max_lr": 1e-4,
      "warmup_num_steps": 1000
    }
  }
  1. Prepare correctly the training script:
# Initialize the distributed process

rank = int(os.environ["OMPI_COMM_WORLD_RANK"])
local_rank = int(os.environ["OMPI_COMM_WORLD_LOCAL_RANK"])
world_size = int(os.environ["OMPI_COMM_WORLD_SIZE"])

deepspeed.init_distributed(dist_backend="nccl", rank=rank, world_size=world_size)

# initialize the model engine that will take care of everything else 
# in this case I'm using a train and validation dataloader initialized 
# before with a distributed sampler

model_engine, optimizer, _, scheduler = deepspeed.initialize(
        args=args, model=model, model_parameters=create_moe_param_groups(model))
        
batch_size = model_engine.train_micro_batch_size_per_gpu()

train_sampler = DistributedSampler(dataset=train_ds)
val_sampler = DistributedSampler(dataset=val_ds)

train_dl = DataLoader(..., batch_size=batch_size, sampler=train_sampler)
val_dl = DataLoader(..., batch_size=batch_size,, sampler=val_sampler)

# compute and backpropagate the loss (in this case we use mixed precision)

y = model_engine(x.half())
loss = compute_loss(y_hat, y)

model_engine.backward(loss)
model_engine.step()

# to have a better validation assessment, aggregate metrics across devices in evaluation
# dist is definded as: import torch.distributed as dist

dist.reduce(loss, dst=0, op=dist.ReduceOp.SUM)
val_loss += loss.detach().item() / world_size
  1. Finally this is the launch command for deepspeed with OpenMPI that has to be ran exclusively on the MASTER node:
deepspeed --hostfile=path/to/hostfile \
          --master_addr=xxx.xxx.xxx.xxx \
          --master_port=1234 \
          --launcher=openmpi \
          --launcher_args="-bind-to none -map-by slot -x PATH=$PATH:/path/to/venv/my_venv/bin" \
          train.py --deepspeed --deepspeed_config "path/to/ds_config.json" 
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment