Skip to content

Instantly share code, notes, and snippets.

@jiahao87
Last active May 29, 2024 18:00
Show Gist options
  • Save jiahao87/50cec29725824da7ff6dd9314b53c4b3 to your computer and use it in GitHub Desktop.
Save jiahao87/50cec29725824da7ff6dd9314b53c4b3 to your computer and use it in GitHub Desktop.
Pytorch script for fine-tuning Pegasus Large model
"""Script for fine-tuning Pegasus
Example usage:
# use XSum dataset as example, with first 1000 docs as training data
from datasets import load_dataset
dataset = load_dataset("xsum")
train_texts, train_labels = dataset['train']['document'][:1000], dataset['train']['summary'][:1000]
# use Pegasus Large model as base for fine-tuning
model_name = 'google/pegasus-large'
train_dataset, _, _, tokenizer = prepare_data(model_name, train_texts, train_labels)
trainer = prepare_fine_tuning(model_name, tokenizer, train_dataset)
trainer.train()
Reference:
https://huggingface.co/transformers/master/custom_datasets.html
"""
from transformers import PegasusForConditionalGeneration, PegasusTokenizer, Trainer, TrainingArguments
import torch
class PegasusDataset(torch.utils.data.Dataset):
def __init__(self, encodings, labels):
self.encodings = encodings
self.labels = labels
def __getitem__(self, idx):
item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
item['labels'] = torch.tensor(self.labels['input_ids'][idx]) # torch.tensor(self.labels[idx])
return item
def __len__(self):
return len(self.labels['input_ids']) # len(self.labels)
def prepare_data(model_name,
train_texts, train_labels,
val_texts=None, val_labels=None,
test_texts=None, test_labels=None):
"""
Prepare input data for model fine-tuning
"""
tokenizer = PegasusTokenizer.from_pretrained(model_name)
prepare_val = False if val_texts is None or val_labels is None else True
prepare_test = False if test_texts is None or test_labels is None else True
def tokenize_data(texts, labels):
encodings = tokenizer(texts, truncation=True, padding=True)
decodings = tokenizer(labels, truncation=True, padding=True)
dataset_tokenized = PegasusDataset(encodings, decodings)
return dataset_tokenized
train_dataset = tokenize_data(train_texts, train_labels)
val_dataset = tokenize_data(val_texts, val_labels) if prepare_val else None
test_dataset = tokenize_data(test_texts, test_labels) if prepare_test else None
return train_dataset, val_dataset, test_dataset, tokenizer
def prepare_fine_tuning(model_name, tokenizer, train_dataset, val_dataset=None, freeze_encoder=False, output_dir='./results'):
"""
Prepare configurations and base model for fine-tuning
"""
torch_device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = PegasusForConditionalGeneration.from_pretrained(model_name).to(torch_device)
if freeze_encoder:
for param in model.model.encoder.parameters():
param.requires_grad = False
if val_dataset is not None:
training_args = TrainingArguments(
output_dir=output_dir, # output directory
num_train_epochs=2000, # total number of training epochs
per_device_train_batch_size=1, # batch size per device during training, can increase if memory allows
per_device_eval_batch_size=1, # batch size for evaluation, can increase if memory allows
save_steps=500, # number of updates steps before checkpoint saves
save_total_limit=5, # limit the total amount of checkpoints and deletes the older checkpoints
evaluation_strategy='steps', # evaluation strategy to adopt during training
eval_steps=100, # number of update steps before evaluation
warmup_steps=500, # number of warmup steps for learning rate scheduler
weight_decay=0.01, # strength of weight decay
logging_dir='./logs', # directory for storing logs
logging_steps=10,
)
trainer = Trainer(
model=model, # the instantiated 🤗 Transformers model to be trained
args=training_args, # training arguments, defined above
train_dataset=train_dataset, # training dataset
eval_dataset=val_dataset, # evaluation dataset
tokenizer=tokenizer
)
else:
training_args = TrainingArguments(
output_dir=output_dir, # output directory
num_train_epochs=2000, # total number of training epochs
per_device_train_batch_size=1, # batch size per device during training, can increase if memory allows
save_steps=500, # number of updates steps before checkpoint saves
save_total_limit=5, # limit the total amount of checkpoints and deletes the older checkpoints
warmup_steps=500, # number of warmup steps for learning rate scheduler
weight_decay=0.01, # strength of weight decay
logging_dir='./logs', # directory for storing logs
logging_steps=10,
)
trainer = Trainer(
model=model, # the instantiated 🤗 Transformers model to be trained
args=training_args, # training arguments, defined above
train_dataset=train_dataset, # training dataset
tokenizer=tokenizer
)
return trainer
if __name__=='__main__':
# use XSum dataset as example, with first 1000 docs as training data
from datasets import load_dataset
dataset = load_dataset("xsum")
train_texts, train_labels = dataset['train']['document'][:1000], dataset['train']['summary'][:1000]
# use Pegasus Large model as base for fine-tuning
model_name = 'google/pegasus-large'
train_dataset, _, _, tokenizer = prepare_data(model_name, train_texts, train_labels)
trainer = prepare_fine_tuning(model_name, tokenizer, train_dataset)
trainer.train()
@superlyc
Copy link

superlyc commented Jun 9, 2021

Hi, I am experimenting with your script. I am quite new to huggingface trainer. Would you help answer one question? I don't understand why I will get OOM on GUDA, if I use the whole dataset, instead of using [:1000] as in the script. Without changing any other parameters (especially batch_size), why training on more data, will cause OOM. Thank you in advance.

@jiahao87
Copy link
Author

Hi, I am experimenting with your script. I am quite new to huggingface trainer. Would you help answer one question? I don't understand why I will get OOM on GUDA, if I use the whole dataset, instead of using [:1000] as in the script. Without changing any other parameters (especially batch_size), why training on more data, will cause OOM. Thank you in advance.

Hi @superlyc, the memory issue is probably not because of Hugging Face's trainer but it's because of our custom PegasusDataset.

Currently the whole encodings is loaded in our PegasusDataset. Compare that against the way the Dataset class is written here. Hence, you may need to rewrite this portion of the code and the way the data is to be loaded in order to reduce memory usage. Hope this clarifies.

class PegasusDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings

@superlyc
Copy link

Hi, I am experimenting with your script. I am quite new to huggingface trainer. Would you help answer one question? I don't understand why I will get OOM on GUDA, if I use the whole dataset, instead of using [:1000] as in the script. Without changing any other parameters (especially batch_size), why training on more data, will cause OOM. Thank you in advance.

Hi @superlyc, the memory issue is probably not because of Hugging Face's trainer but it's because of our custom PegasusDataset.

Currently the whole encodings is loaded in our PegasusDataset. Compare that against the way the Dataset class is written here. Hence, you may need to rewrite this portion of the code and the way the data is to be loaded in order to reduce memory usage. Hope this clarifies.

class PegasusDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings

Thank you

@rishav2416
Copy link

Hi, @jiahao87, I have been trying to run your script in a Notebook instance in AWS sagemaker which has 8 GPU, each of 12 GB. Everytime, I am trying to run your script with absolutely no change, I am getting the following error: RuntimeError: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 0; 11.17 GiB total capacity; 10.49 GiB already allocated; 61.44 MiB free; 10.65 GiB reserved in total by PyTorch)
Could you please help?

@slvcsl
Copy link

slvcsl commented Jun 24, 2021

Hi @jiahao87, I have a couple of questions:

  1. What is the difference between this script and the summarization example (transformers/examples/seq2seq/run_summarization.py)? Is the example supposed to work with pegasus-large?
  2. How much GPU RAM is Pegasus large expected to use? I am currently trying to fine-tune the model (using the example, not this script, but I'll try it out) on big_patent on two RTX2080ti (11GB each) and get the OOM error even with input/output max_length = 10 and batch size = 1. Is this expected?

RuntimeError: CUDA out of memory. Tried to allocate 16.00 MiB (GPU 0; 10.76 GiB total capacity; 9.58 GiB already allocated; 2.62 MiB free; 9.77 GiB reserved in total by PyTorch)

@jiahao87
Copy link
Author

Hi, @jiahao87, I have been trying to run your script in a Notebook instance in AWS sagemaker which has 8 GPU, each of 12 GB. Everytime, I am trying to run your script with absolutely no change, I am getting the following error: RuntimeError: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 0; 11.17 GiB total capacity; 10.49 GiB already allocated; 61.44 MiB free; 10.65 GiB reserved in total by PyTorch)
Could you please help?

Hi @rishav2416, fine-tuning the full Pegasus large model is indeed resource intensive. I was only able to run the fine-tuning on Colab (GPU with 12GB RAM) when I freeze the encoder (see line below). Which notebook instance type are you using? You may wish to experiment with other instance types.

trainer = prepare_fine_tuning(model_name, tokenizer, train_dataset, freeze_encoder=True)

@jiahao87
Copy link
Author

Hi @jiahao87, I have a couple of questions:

1. What is the difference between this script and the summarization example (transformers/examples/seq2seq/run_summarization.py)? Is the example supposed to work with pegasus-large?

2. How much GPU RAM is Pegasus large expected to use? I am currently trying to fine-tune the model (using the example, not this script, but I'll try it out) on big_patent on two RTX2080ti (11GB each) and get the OOM error even with input/output max_length = 10 and batch size = 1. Is this expected?

RuntimeError: CUDA out of memory. Tried to allocate 16.00 MiB (GPU 0; 10.76 GiB total capacity; 9.58 GiB already allocated; 2.62 MiB free; 9.77 GiB reserved in total by PyTorch)

Hi @slvcsl, the main difference seems to be that the summarization example uses the Seq2SeqTrainer class, while this script uses the Trainer class. As pointed out here, the difference between these 2 classes is that Seq2SeqTrainer is a subclass of Trainer. You can read the link provided for details.

As for the memory usage, you may wish to refer to the above replies. Unfortunately, I do not have a specific number for the amount of GPU RAM that Pegasus large is expected to use. If anyone else reading this comment is able to chip in, please do so. Hope the earlier reply was able to help you to some extent.

@karimfayed
Copy link

Hi @jiahao87, I have a couple of questions:

1. What is the difference between this script and the summarization example (transformers/examples/seq2seq/run_summarization.py)? Is the example supposed to work with pegasus-large?

2. How much GPU RAM is Pegasus large expected to use? I am currently trying to fine-tune the model (using the example, not this script, but I'll try it out) on big_patent on two RTX2080ti (11GB each) and get the OOM error even with input/output max_length = 10 and batch size = 1. Is this expected?

RuntimeError: CUDA out of memory. Tried to allocate 16.00 MiB (GPU 0; 10.76 GiB total capacity; 9.58 GiB already allocated; 2.62 MiB free; 9.77 GiB reserved in total by PyTorch)

Hi @slvcsl, the main difference seems to be that the summarization example uses the Seq2SeqTrainer class, while this script uses the Trainer class. As pointed out here, the difference between these 2 classes is that Seq2SeqTrainer is a subclass of Trainer. You can read the link provided for details.

As for the memory usage, you may wish to refer to the above replies. Unfortunately, I do not have a specific number for the amount of GPU RAM that Pegasus large is expected to use. If anyone else reading this comment is able to chip in, please do so. Hope the earlier reply was able to help you to some extent.

I had this problem early on and I was told that it is recommended to have 16 GB or more and for further help, this is the issue in which has also other recommendations.

@MariaMegalli
Copy link

hello @jiahao87, can you please help me and explain why is it that sometimes the number of steps is double the number of epochs and sometimes it the same. For example:
For Batch size = 1 , training dateset = 1000 and epochs = 2000 the steps = 4000
while
For Batch size = 2 , training dataset = 1000 and epochs = 2000 the steps = 2000
and so can you also explain steps and their role as after these scenarios I got confused?

@jiahao87
Copy link
Author

jiahao87 commented Jul 2, 2021

Hi @jiahao87, I have a couple of questions:

1. What is the difference between this script and the summarization example (transformers/examples/seq2seq/run_summarization.py)? Is the example supposed to work with pegasus-large?

2. How much GPU RAM is Pegasus large expected to use? I am currently trying to fine-tune the model (using the example, not this script, but I'll try it out) on big_patent on two RTX2080ti (11GB each) and get the OOM error even with input/output max_length = 10 and batch size = 1. Is this expected?

RuntimeError: CUDA out of memory. Tried to allocate 16.00 MiB (GPU 0; 10.76 GiB total capacity; 9.58 GiB already allocated; 2.62 MiB free; 9.77 GiB reserved in total by PyTorch)

Hi @slvcsl, the main difference seems to be that the summarization example uses the Seq2SeqTrainer class, while this script uses the Trainer class. As pointed out here, the difference between these 2 classes is that Seq2SeqTrainer is a subclass of Trainer. You can read the link provided for details.
As for the memory usage, you may wish to refer to the above replies. Unfortunately, I do not have a specific number for the amount of GPU RAM that Pegasus large is expected to use. If anyone else reading this comment is able to chip in, please do so. Hope the earlier reply was able to help you to some extent.

I had this problem early on and I was told that it is recommended to have 16 GB or more and for further help, this is the issue in which has also other recommendations.

@karimfayed, thank you for the link. That was useful.

@jiahao87
Copy link
Author

jiahao87 commented Jul 2, 2021

hello @jiahao87, can you please help me and explain why is it that sometimes the number of steps is double the number of epochs and sometimes it the same. For example:
For Batch size = 1 , training dateset = 1000 and epochs = 2000 the steps = 4000
while
For Batch size = 2 , training dataset = 1000 and epochs = 2000 the steps = 2000
and so can you also explain steps and their role as after these scenarios I got confused?

@MariaMegalli, thank you for pointing this issue out. I have edited the code below and running the new code, the number of steps should now make sense. Let me know if you still encounter issues. Thank you.

    def __len__(self):
        return len(self.labels['input_ids'])

@karimfayed
Copy link

@jiahao87, is there a way to convert the model after fine-tuning from pytorch to tensorflow so I can use it in a javascript backend ?

@MariaMegalli
Copy link

hi @jiahao87, what is the default maximum input and output length in this script?

@jiahao87
Copy link
Author

@jiahao87, is there a way to convert the model after fine-tuning from pytorch to tensorflow so I can use it in a javascript backend ?

@karimfayed, try ONNX

@jiahao87
Copy link
Author

hi @jiahao87, what is the default maximum input and output length in this script?

Hi @MariaMegalli, please see Hugging Face's config here.

@Oxi84
Copy link

Oxi84 commented Sep 15, 2021

I do not understand where is the saved model. I specified

Hi, @jiahao87, I have been trying to run your script in a Notebook instance in AWS sagemaker which has 8 GPU, each of 12 GB. Everytime, I am trying to run your script with absolutely no change, I am getting the following error: RuntimeError: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 0; 11.17 GiB total capacity; 10.49 GiB already allocated; 61.44 MiB free; 10.65 GiB reserved in total by PyTorch)
Could you please help?

Hi @rishav2416, fine-tuning the full Pegasus large model is indeed resource intensive. I was only able to run the fine-tuning on Colab (GPU with 12GB RAM) when I freeze the encoder (see line below). Which notebook instance type are you using? You may wish to experiment with other instance types.

trainer = prepare_fine_tuning(model_name, tokenizer, train_dataset, freeze_encoder=True)

If you do this freeze thing, does this decrease the performance? And would this also freeze the input embedding?

Also you should add trainer.save_model(" output_dir") . And these checkpoints do use a lot of space.

@AnveshAeturi
Copy link

AnveshAeturi commented Sep 16, 2021

temp


Throwing this error: anybody help me out

-> 16 batch = tokenizer.prepare_seq2seq_batch(src_text, truncation=True, padding='longest').to(torch_device)
17 translated = model.generate(**batch)
18 tgt_text = tokenizer.batch_decode(translated, skip_special_tokens=True)

AttributeError: 'NoneType' object has no attribute 'prepare_seq2seq_batch'

@Berowne
Copy link

Berowne commented Dec 9, 2021

Thank you for the script... very helpful!
I was struggling with huggingface's samples based on T5 before.

@keloemma
Copy link

is it possible to modify pegasusDataset class to use dataset with no label ? if yes which modifications should be taken ?

@jiahao87
Copy link
Author

is it possible to modify pegasusDataset class to use dataset with no label ? if yes which modifications should be taken ?

@keloemma, I don't think so. Labels are needed since we are doing supervised training, be it manually created labels or auto-generated labels.

@keloemma
Copy link

keloemma commented Dec 11, 2021 via email

@Darshan2104
Copy link

Hey, I have been trying to fine-tune PEGASUS-large in google colab(basic with 12GB ram) but got crashed all the time. Anybody can help me with this issue and also suggest me best arg so that it can run on colab.

Thanks :)

@karimfayed
Copy link

Hey @Darshan2104 ,
I have faced this problem before, unfortunately you won't be able to fine-tune pegasus using basic colab. You will need to subscribe to colab pro as the computational power needed for fine-tuning pegasus is quite big. Also make sure to use gpu memory of 16280MB.

@dzombiee
Copy link

@Darshan2104
freeze the encoder as shown below
def prepare_fine_tuning(model_name, tokenizer, train_dataset, val_dataset=None, freeze_encoder=True, output_dir='results')

also reduce the batch size to 2/4/8 and also try reducing the training data to first 1000 rows.

I was able to fine-tune PEGASUS-large by doing things mentioned.

@KFati
Copy link

KFati commented Jul 20, 2022

Hello
Do you have a separate code for the encoder-decoder model used for the pegasus model?

@aayushee-gooru
Copy link

Hi

Thank you for this easy to understand fine-tuning script. I am fine-tuning pegasus-wikihow on Google Colab with 1000 examples of a custom dataset and device batch size =2.
I was wondering whether someone experimented with fp16 training parameter to train faster. Please let me know in case it worked for you.

@NamraRehman
Copy link

Thank you for the script.
Here are a few concerns, I am facing while tuning.

  1. The tuning consumes too much space (I ran it on Kaggle), and the output directory got full on 8 epochs only with 400 Training samples.
  2. Additionally, I can't see where the model is saved, rather all memory is taken up by checkpoints.
  3. Lastly, Do you have code for the inference/validation part?

@NamraRehman
Copy link

I do not understand where is the saved model. I specified

Hi, @jiahao87, I have been trying to run your script in a Notebook instance in AWS sagemaker which has 8 GPU, each of 12 GB. Everytime, I am trying to run your script with absolutely no change, I am getting the following error: RuntimeError: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 0; 11.17 GiB total capacity; 10.49 GiB already allocated; 61.44 MiB free; 10.65 GiB reserved in total by PyTorch)
Could you please help?

Hi @rishav2416, fine-tuning the full Pegasus large model is indeed resource intensive. I was only able to run the fine-tuning on Colab (GPU with 12GB RAM) when I freeze the encoder (see line below). Which notebook instance type are you using? You may wish to experiment with other instance types.

trainer = prepare_fine_tuning(model_name, tokenizer, train_dataset, freeze_encoder=True)

If you do this freeze thing, does this decrease the performance? And would this also freeze the input embedding?

Also you should add trainer.save_model(" output_dir") . And these checkpoints do use a lot of space.

The trainer.save_model()
gave me the below cuda error.

RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

@Iqra-be-coder
Copy link

Iqra-be-coder commented Jun 2, 2023

aoa i am mscs student and perform thesis on abstractive text summarization on pubmed dataset please some one guide me a platform which unlimited time access to train model or high memory i tried kaggle notebook,colab but these are not fulfill my requirement please some one suggest me a platform with free or low budget

@gbekss
Copy link

gbekss commented Dec 1, 2023

Hi @jiahao87,

I tried your code with 1000 rows from cnn_dailymail, but in every try I keep having a very high validation loss, that goes down for the first few epochs, and then starts growing again. Moreover, the improvements in the results is not that remarkable as expected from a decent finetuning.
WRT the parameters, I tried yours and all combinations that could be compatible with Kaggle Notebooks/Colab Free limitations. Encoder is frozen.
Do you have some advice to reduce loss and improve results?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment