Xiaoying Zhi anniezhi

## ddp_notes.md

      
              1 file
            
          
              25 forks
            
          
              18 comments
            
          
              177 stars
            
          
                TengdaHan
                / ddp_notes.md
            
            
              Last active
              May 21, 2024 07:02
            
              
                Multi-node-training on slurm with PyTorch
              
          
    Multi-node-training on slurm with PyTorch

What's this?


A simple note for how to start multi-node-training on slurm scheduler with PyTorch.
Useful especially when scheduler is too busy that you cannot get multiple GPUs allocated,
or you need more than 4 GPUs for a single job.
Requirement: Have to use PyTorch DistributedDataParallel(DDP) for this purpose.
Warning: might need to re-factor your own code.
Warning: might be secretly condemned by your colleagues because using too many GPUs.