Rudrabha Mukhopadhyay Rudrabha

## convert_envyml_to_reqtxt
import ruamel.yaml

yaml = ruamel.yaml.YAML()
data = yaml.load(open('environment.yml'))

requirements = []
for dep in data['dependencies']:
    if isinstance(dep, str):
        package, package_version, python_version = dep.split('=')
        if python_version == '0':

## ddp_notes.md

      
              1 file
            
          
              26 forks
            
          
              18 comments
            
          
              183 stars
            
          
                TengdaHan
                / ddp_notes.md
            
            
              Last active
              July 2, 2024 06:39
            
              
                Multi-node-training on slurm with PyTorch
              
          
    Multi-node-training on slurm with PyTorch

What's this?


A simple note for how to start multi-node-training on slurm scheduler with PyTorch.
Useful especially when scheduler is too busy that you cannot get multiple GPUs allocated,
or you need more than 4 GPUs for a single job.
Requirement: Have to use PyTorch DistributedDataParallel(DDP) for this purpose.
Warning: might need to re-factor your own code.
Warning: might be secretly condemned by your colleagues because using too many GPUs.


## ddp_example.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from argparse import ArgumentParser

import torch
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.utils.data import DataLoader, Dataset
from torch.utils.data.distributed import DistributedSampler
from transformers import BertForMaskedLM
	import ruamel.yaml

	yaml = ruamel.yaml.YAML()
	data = yaml.load(open('environment.yml'))

	requirements = []
	for dep in data['dependencies']:
	if isinstance(dep, str):
	package, package_version, python_version = dep.split('=')
	if python_version == '0':
	#!/usr/bin/env python
	# -- coding: utf-8 --
	from argparse import ArgumentParser

	import torch
	import torch.distributed as dist
	from torch.nn.parallel import DistributedDataParallel as DDP
	from torch.utils.data import DataLoader, Dataset
	from torch.utils.data.distributed import DistributedSampler
	from transformers import BertForMaskedLM