Skip to content

Instantly share code, notes, and snippets.

View dyhan316's full-sized avatar

Danny dyhan316

  • Seoul
  • 22:58 (UTC +09:00)
View GitHub Profile
@TengdaHan
TengdaHan / ddp_notes.md
Last active July 16, 2024 19:17
Multi-node-training on slurm with PyTorch

Multi-node-training on slurm with PyTorch

What's this?

  • A simple note for how to start multi-node-training on slurm scheduler with PyTorch.
  • Useful especially when scheduler is too busy that you cannot get multiple GPUs allocated, or you need more than 4 GPUs for a single job.
  • Requirement: Have to use PyTorch DistributedDataParallel(DDP) for this purpose.
  • Warning: might need to re-factor your own code.
  • Warning: might be secretly condemned by your colleagues because using too many GPUs.