Daniil Larionov Rexhaif

## ddp_notes.md

      
              1 file
            
          
              26 forks
            
          
              18 comments
            
          
              183 stars
            
          
                TengdaHan
                / ddp_notes.md
            
            
              Last active
              July 2, 2024 06:39
            
              
                Multi-node-training on slurm with PyTorch
              
          
    Multi-node-training on slurm with PyTorch

What's this?


A simple note for how to start multi-node-training on slurm scheduler with PyTorch.
Useful especially when scheduler is too busy that you cannot get multiple GPUs allocated,
or you need more than 4 GPUs for a single job.
Requirement: Have to use PyTorch DistributedDataParallel(DDP) for this purpose.
Warning: might need to re-factor your own code.
Warning: might be secretly condemned by your colleagues because using too many GPUs.


## export_trace.py
from transformers import BertForQuestionAnswering
import torch

bert_name = "bert-large-uncased-whole-word-masking-finetuned-squad"

model = BertForQuestionAnswering.from_pretrained(bert_name, torchscript=True)
model.eval()

inputs = [torch.ones(1, 2, dtype=torch.int64),
          torch.ones(1, 2, dtype=torch.int64),

## bash_strict_mode.md

      
              1 file
            
          
              198 forks
            
          
              47 comments
            
          
              1284 stars
            
          
                mohanpedala
                / bash_strict_mode.md
            
            
              Last active
              July 4, 2024 12:40
            
              
                set -e, -u, -o, -x pipefail explanation
              
          
    Table of Contents


set -e, -u, -x, -o pipefail
set -e
set -x
set -u
set -o pipefail
Setting IFS
Original Reference

set -e, -u, -x, -o pipefail
	from transformers import BertForQuestionAnswering
	import torch

	bert_name = "bert-large-uncased-whole-word-masking-finetuned-squad"

	model = BertForQuestionAnswering.from_pretrained(bert_name, torchscript=True)
	model.eval()

	inputs = [torch.ones(1, 2, dtype=torch.int64),
	torch.ones(1, 2, dtype=torch.int64),