In this tutorial we assume to launch a distributed training on 2 nodes using DeepSpeed with the OpenMPI Launcher.
- First of all DeepSpeed needs a passwordless ssh connection with all the nodes, MASTER included:
# generate a public/private ssh key and make sure to NOT insert a passphrase
ssh-keygen -t rsa
# copy public key 'id_rsa' on the MASTER and SLAVE
ssh-copy-id -i ~/.ssh/id_rsa.pub username@master && ssh-copy-id -i ~/.ssh/id_rsa.pub username@slave
# add the key on the MASTER (not sure if needed, but better be safe):
ssh-add ~/.ssh/id_rsa
- We need a hostfile and a ds_config.json that look like this (do the required modifications depending on what you need, this are just examples):
----------------- hostfile -----------------
# it works both with ssh aliases or directly the ip-addresses, slots is the number of gpus on the node
master slots=1
slave slots=1
--------------- ds_config.json --------------
{
"train_micro_batch_size_per_gpu": 1,
"gradient_accumulation_steps": 8,
"optimizer": {
"type": "Adam",
"params": {
"lr": 1e-4,
"betas": [
0.9,
0.99
]
}
},
"scheduler": {
"type": "WarmupDecayLR",
"params": {
"total_num_steps": 500000,
"warmup_min_lr": 0,
"warmup_max_lr": 1e-4,
"warmup_num_steps": 1000
}
}
- Prepare correctly the training script:
# Initialize the distributed process
rank = int(os.environ["OMPI_COMM_WORLD_RANK"])
local_rank = int(os.environ["OMPI_COMM_WORLD_LOCAL_RANK"])
world_size = int(os.environ["OMPI_COMM_WORLD_SIZE"])
deepspeed.init_distributed(dist_backend="nccl", rank=rank, world_size=world_size)
# initialize the model engine that will take care of everything else
# in this case I'm using a train and validation dataloader initialized
# before with a distributed sampler
model_engine, optimizer, _, scheduler = deepspeed.initialize(
args=args, model=model, model_parameters=create_moe_param_groups(model))
batch_size = model_engine.train_micro_batch_size_per_gpu()
train_sampler = DistributedSampler(dataset=train_ds)
val_sampler = DistributedSampler(dataset=val_ds)
train_dl = DataLoader(..., batch_size=batch_size, sampler=train_sampler)
val_dl = DataLoader(..., batch_size=batch_size,, sampler=val_sampler)
# compute and backpropagate the loss (in this case we use mixed precision)
y = model_engine(x.half())
loss = compute_loss(y_hat, y)
model_engine.backward(loss)
model_engine.step()
# to have a better validation assessment, aggregate metrics across devices in evaluation
# dist is definded as: import torch.distributed as dist
dist.reduce(loss, dst=0, op=dist.ReduceOp.SUM)
val_loss += loss.detach().item() / world_size
- Finally this is the launch command for deepspeed with OpenMPI that has to be ran exclusively on the MASTER node:
deepspeed --hostfile=path/to/hostfile \
--master_addr=xxx.xxx.xxx.xxx \
--master_port=1234 \
--launcher=openmpi \
--launcher_args="-bind-to none -map-by slot -x PATH=$PATH:/path/to/venv/my_venv/bin" \
train.py --deepspeed --deepspeed_config "path/to/ds_config.json"