Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Pytorch script for fine-tuning Pegasus Large model
"""Script for fine-tuning Pegasus
Example usage:
# use XSum dataset as example, with first 1000 docs as training data
from datasets import load_dataset
dataset = load_dataset("xsum")
train_texts, train_labels = dataset['train']['document'][:1000], dataset['train']['summary'][:1000]
# use Pegasus Large model as base for fine-tuning
model_name = 'google/pegasus-large'
train_dataset, _, _, tokenizer = prepare_data(model_name, train_texts, train_labels)
trainer = prepare_fine_tuning(model_name, tokenizer, train_dataset)
trainer.train()
Reference:
https://huggingface.co/transformers/master/custom_datasets.html
"""
from transformers import PegasusForConditionalGeneration, PegasusTokenizer, Trainer, TrainingArguments
import torch
class PegasusDataset(torch.utils.data.Dataset):
def __init__(self, encodings, labels):
self.encodings = encodings
self.labels = labels
def __getitem__(self, idx):
item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
item['labels'] = torch.tensor(self.labels['input_ids'][idx]) # torch.tensor(self.labels[idx])
return item
def __len__(self):
return len(self.labels['input_ids']) # len(self.labels)
def prepare_data(model_name,
train_texts, train_labels,
val_texts=None, val_labels=None,
test_texts=None, test_labels=None):
"""
Prepare input data for model fine-tuning
"""
tokenizer = PegasusTokenizer.from_pretrained(model_name)
prepare_val = False if val_texts is None or val_labels is None else True
prepare_test = False if test_texts is None or test_labels is None else True
def tokenize_data(texts, labels):
encodings = tokenizer(texts, truncation=True, padding=True)
decodings = tokenizer(labels, truncation=True, padding=True)
dataset_tokenized = PegasusDataset(encodings, decodings)
return dataset_tokenized
train_dataset = tokenize_data(train_texts, train_labels)
val_dataset = tokenize_data(val_texts, val_labels) if prepare_val else None
test_dataset = tokenize_data(test_texts, test_labels) if prepare_test else None
return train_dataset, val_dataset, test_dataset, tokenizer
def prepare_fine_tuning(model_name, tokenizer, train_dataset, val_dataset=None, freeze_encoder=False, output_dir='./results'):
"""
Prepare configurations and base model for fine-tuning
"""
torch_device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = PegasusForConditionalGeneration.from_pretrained(model_name).to(torch_device)
if freeze_encoder:
for param in model.model.encoder.parameters():
param.requires_grad = False
if val_dataset is not None:
training_args = TrainingArguments(
output_dir=output_dir, # output directory
num_train_epochs=2000, # total number of training epochs
per_device_train_batch_size=1, # batch size per device during training, can increase if memory allows
per_device_eval_batch_size=1, # batch size for evaluation, can increase if memory allows
save_steps=500, # number of updates steps before checkpoint saves
save_total_limit=5, # limit the total amount of checkpoints and deletes the older checkpoints
evaluation_strategy='steps', # evaluation strategy to adopt during training
eval_steps=100, # number of update steps before evaluation
warmup_steps=500, # number of warmup steps for learning rate scheduler
weight_decay=0.01, # strength of weight decay
logging_dir='./logs', # directory for storing logs
logging_steps=10,
)
trainer = Trainer(
model=model, # the instantiated 🤗 Transformers model to be trained
args=training_args, # training arguments, defined above
train_dataset=train_dataset, # training dataset
eval_dataset=val_dataset, # evaluation dataset
tokenizer=tokenizer
)
else:
training_args = TrainingArguments(
output_dir=output_dir, # output directory
num_train_epochs=2000, # total number of training epochs
per_device_train_batch_size=1, # batch size per device during training, can increase if memory allows
save_steps=500, # number of updates steps before checkpoint saves
save_total_limit=5, # limit the total amount of checkpoints and deletes the older checkpoints
warmup_steps=500, # number of warmup steps for learning rate scheduler
weight_decay=0.01, # strength of weight decay
logging_dir='./logs', # directory for storing logs
logging_steps=10,
)
trainer = Trainer(
model=model, # the instantiated 🤗 Transformers model to be trained
args=training_args, # training arguments, defined above
train_dataset=train_dataset, # training dataset
tokenizer=tokenizer
)
return trainer
if __name__=='__main__':
# use XSum dataset as example, with first 1000 docs as training data
from datasets import load_dataset
dataset = load_dataset("xsum")
train_texts, train_labels = dataset['train']['document'][:1000], dataset['train']['summary'][:1000]
# use Pegasus Large model as base for fine-tuning
model_name = 'google/pegasus-large'
train_dataset, _, _, tokenizer = prepare_data(model_name, train_texts, train_labels)
trainer = prepare_fine_tuning(model_name, tokenizer, train_dataset)
trainer.train()
@karimfayed

This comment has been minimized.

Copy link

@karimfayed karimfayed commented Mar 31, 2021

Why does the prepare_fine_tuning function only have val dateset in its parameters and not a test_dataset too?

@jiahao87

This comment has been minimized.

Copy link
Owner Author

@jiahao87 jiahao87 commented Apr 1, 2021

Hi @karimfayed, please feel free to add in test_dataset to the function. I chose to leave it out in the end because I did not find the test metrics useful when the dataset is small. Hope this clarifies.

@karrtikiyer

This comment has been minimized.

Copy link

@karrtikiyer karrtikiyer commented Apr 1, 2021

Hi @jiahao87, if we are training or fine tuning with our custom data set, in that case, is there some specific format in which the document should exist? Like if we raw text of articles which we want to train on, and if it contains headings, and paragraphs etc, is there a need for this to be converted or preprocessed? Like I see in Xsum dataset, instead of newline character \n is seen between lines. Can you please advise if you have any information on it?

@gautierdag

This comment has been minimized.

Copy link

@gautierdag gautierdag commented Apr 1, 2021

Just found this gist randomly, really helpful!

Sidenote if you switch out the tokenizer for PegasusTokenizerFast you can speed up the tokenization considerably. The output of the Fast version can be slightly different though due to the current underlying implementation. I still haven't compared how both versions fare in accuracy - but if you are fine-tuning anyways that might be a good switch to make early on.

Additionally, you can do:

        encodings = tokenizer(texts, truncation=True, padding=True, return_tensors="pt")
        decodings = tokenizer(labels, truncation=True, padding=True, return_tensors="pt")

And then discard the casting to torch.tensor in the dataset __getitem__.

@jiahao87

This comment has been minimized.

Copy link
Owner Author

@jiahao87 jiahao87 commented Apr 1, 2021

Hi @jiahao87, if we are training or fine tuning with our custom data set, in that case, is there some specific format in which the document should exist? Like if we raw text of articles which we want to train on, and if it contains headings, and paragraphs etc, is there a need for this to be converted or preprocessed? Like I see in Xsum dataset, instead of newline character \n is seen between lines. Can you please advise if you have any information on it?

Hi @karrtikiyer, my sense is that the "difference" between newline character and '\n' is probably an illusion; both ends up being treated the same. You can test this out by the code below:

model_name = 'google/pegasus-large'
tokenizer = PegasusTokenizer.from_pretrained(model_name)

text = """line1
line2"""
tokenizer([text], truncation=True, padding=True)
# output: {'input_ids': [[540, 740, 540, 522, 1]], 'attention_mask': [[1, 1, 1, 1, 1]]}

text = """line1\nline2"""
tokenizer([text], truncation=True, padding=True)
# output: {'input_ids': [[540, 740, 540, 522, 1]], 'attention_mask': [[1, 1, 1, 1, 1]]}
@karrtikiyer

This comment has been minimized.

Copy link

@karrtikiyer karrtikiyer commented Apr 1, 2021

Thanks a lot @jiahao87, I will try this out.

@jiahao87

This comment has been minimized.

Copy link
Owner Author

@jiahao87 jiahao87 commented Apr 1, 2021

Just found this gist randomly, really helpful!

Sidenote if you switch out the tokenizer for PegasusTokenizerFast you can speed up the tokenization considerably. The output of the Fast version can be slightly different though due to the current underlying implementation. I still haven't compared how both versions fare in accuracy - but if you are fine-tuning anyways that might be a good switch to make early on.

Additionally, you can do:

        encodings = tokenizer(texts, truncation=True, padding=True, return_tensors="pt")
        decodings = tokenizer(labels, truncation=True, padding=True, return_tensors="pt")

And then discard the casting to torch.tensor in the dataset __getitem__.

Hi @gautierdag,
Glad that you find the gist useful. Also thanks a lot for your suggestions! Regarding PegasusTokenizerFast, if you manage to test it out, it would be great if you could share your results here so that everyone can also have a sense of how much faster and how much difference is the accuracy.

@karimfayed

This comment has been minimized.

Copy link

@karimfayed karimfayed commented Apr 1, 2021

Hi @karimfayed, please feel free to add in test_dataset to the function. I chose to leave it out in the end because I did not find the test metrics useful when the dataset is small. Hope this clarifies.

Thank you for the clarification, but upon reading the documentation of the Trainer class there wasn't any parameter that would take a testing dataset.
Only train_dataset and eval_dataset are present.
Is there any other way to add testing dataset?

@jiahao87

This comment has been minimized.

Copy link
Owner Author

@jiahao87 jiahao87 commented Apr 2, 2021

Hi @karimfayed, please feel free to add in test_dataset to the function. I chose to leave it out in the end because I did not find the test metrics useful when the dataset is small. Hope this clarifies.

Thank you for the clarification, but upon reading the documentation of the Trainer class there wasn't any parameter that would take a testing dataset.
Only train_dataset and eval_dataset are present.
Is there any other way to add testing dataset?

Hi @karimfayed, you may wish to add this line after the model is fine-tuned:

trainer.evaluate(test_dataset)
@karimfayed

This comment has been minimized.

Copy link

@karimfayed karimfayed commented Apr 4, 2021

Hi @karimfayed, please feel free to add in test_dataset to the function. I chose to leave it out in the end because I did not find the test metrics useful when the dataset is small. Hope this clarifies.

Thank you for the clarification, but upon reading the documentation of the Trainer class there wasn't any parameter that would take a testing dataset.
Only train_dataset and eval_dataset are present.
Is there any other way to add testing dataset?

Hi @karimfayed, you may wish to add this line after the model is fine-tuned:

trainer.evaluate(test_dataset)

Thank you for your help!

@MariaMegalli

This comment has been minimized.

Copy link

@MariaMegalli MariaMegalli commented Apr 4, 2021

Hi @jiahao87, I would like to ask if is the Training loss considered as a percentage or does it have other units

@jiahao87

This comment has been minimized.

Copy link
Owner Author

@jiahao87 jiahao87 commented Apr 5, 2021

Hi @jiahao87, I would like to ask if is the Training loss considered as a percentage or does it have other units

Hi @MariaMegalli,

If we look at the source code of HuggingFace, we will notice that the loss is actually Cross Entropy loss. In the documentation, the loss is stated as language modeling loss, which is typically perplexity. But if we google the relationship between cross entropy and perplexity, we will realize that one is just an exponent of the other. Hope this helps.

if labels is not None:
    loss_fct = CrossEntropyLoss()
    masked_lm_loss = loss_fct(lm_logits.view(-1, self.config.vocab_size), labels.view(-1))
@MariaMegalli

This comment has been minimized.

Copy link

@MariaMegalli MariaMegalli commented Apr 5, 2021

Hi @jiahao87, I would like to ask if is the Training loss considered as a percentage or does it have other units

Hi @MariaMegalli,

If we look at the source code of HuggingFace, we will notice that the loss is actually Cross Entropy loss. In the documentation, the loss is stated as language modeling loss, which is typically perplexity. But if we google the relationship between cross entropy and perplexity, we will realize that one is just an exponent of the other. Hope this helps.

if labels is not None:
    loss_fct = CrossEntropyLoss()
    masked_lm_loss = loss_fct(lm_logits.view(-1, self.config.vocab_size), labels.view(-1))

@jiahao87, So basically the lower training loss here is better because if it's Cross Entropy loss (which is actually produced) then it's having lower uncertainty and if it's perplexity it's less random. and perplexity = e^(Cross Entropy loss) am I getting it write?
Another question, how can I evaluate the training or the results using the rouge scores as it was used in the PEGASUS paper?

@karimfayed

This comment has been minimized.

Copy link

@karimfayed karimfayed commented Apr 6, 2021

Hi @karimfayed, please feel free to add in test_dataset to the function. I chose to leave it out in the end because I did not find the test metrics useful when the dataset is small. Hope this clarifies.

Thank you for the clarification, but upon reading the documentation of the Trainer class there wasn't any parameter that would take a testing dataset.
Only train_dataset and eval_dataset are present.
Is there any other way to add testing dataset?

Hi @karimfayed, you may wish to add this line after the model is fine-tuned:

trainer.evaluate(test_dataset)

Hi @jiahao87 tried updating the code to add validation and testing data sets but I get the following error:
The updated script:https://drive.google.com/file/d/1Q1nYXFvPl6QsBbxd-AK4MmLJMEdc3D_N/view?usp=sharing
Screenshot (30)

@jiahao87

This comment has been minimized.

Copy link
Owner Author

@jiahao87 jiahao87 commented Apr 6, 2021

Hi @jiahao87, I would like to ask if is the Training loss considered as a percentage or does it have other units

Hi @MariaMegalli,
If we look at the source code of HuggingFace, we will notice that the loss is actually Cross Entropy loss. In the documentation, the loss is stated as language modeling loss, which is typically perplexity. But if we google the relationship between cross entropy and perplexity, we will realize that one is just an exponent of the other. Hope this helps.

if labels is not None:
    loss_fct = CrossEntropyLoss()
    masked_lm_loss = loss_fct(lm_logits.view(-1, self.config.vocab_size), labels.view(-1))

@jiahao87, So basically the lower training loss here is better because if it's Cross Entropy loss (which is actually produced) then it's having lower uncertainty and if it's perplexity it's less random. and perplexity = e^(Cross Entropy loss) am I getting it write?
Another question, how can I evaluate the training or the results using the rouge scores as it was used in the PEGASUS paper?

Hi @MariaMegalli, yes you are right. To get ROUGE, you can use the compute_metrics parameter in HuggingFace's Trainer. For more details, please refer to here.

@jiahao87

This comment has been minimized.

Copy link
Owner Author

@jiahao87 jiahao87 commented Apr 6, 2021

Hi @karimfayed, please feel free to add in test_dataset to the function. I chose to leave it out in the end because I did not find the test metrics useful when the dataset is small. Hope this clarifies.

Thank you for the clarification, but upon reading the documentation of the Trainer class there wasn't any parameter that would take a testing dataset.
Only train_dataset and eval_dataset are present.
Is there any other way to add testing dataset?

Hi @karimfayed, you may wish to add this line after the model is fine-tuned:

trainer.evaluate(test_dataset)

Hi @jiahao87 tried updating the code to add validation and testing data sets but I get the following error:
The updated script:https://drive.google.com/file/d/1Q1nYXFvPl6QsBbxd-AK4MmLJMEdc3D_N/view?usp=sharing
Screenshot (30)

Hi @karimfayed, please modify the following lines:

  • line 124 to get the test dataset
  • after line 126, add in trainer.evaluate(test_dataset)

Could you also delete one of your earlier comments, since it's duplicated? Thank you.

@karimfayed

This comment has been minimized.

Copy link

@karimfayed karimfayed commented Apr 6, 2021

  • 4 to get the test datase

Would you please elaborate as I don't understand ?
Note: I deleted the duplicated comment.

@jiahao87

This comment has been minimized.

Copy link
Owner Author

@jiahao87 jiahao87 commented Apr 7, 2021

Would you please elaborate as I don't understand ?
Note: I deleted the duplicated comment.

Hi @karimfayed, please refer to the following example code:

if __name__=='__main__':
  # use XSum dataset as example, with first 1000 docs as training data
  from datasets import load_dataset
  dataset = load_dataset("xsum")
  train_texts, train_labels = dataset['train']['document'][:1000], dataset['train']['summary'][:1000]
  test_texts, test_labels = dataset['test']['document'][:1000], dataset['test']['summary'][:1000]
  
  # use Pegasus Large model as base for fine-tuning
  model_name = 'google/pegasus-large'
  train_dataset, _, test_dataset = prepare_data(model_name, train_texts, train_labels, test_texts=test_texts, test_labels=test_labels)
  trainer = prepare_fine_tuning(model_name, train_dataset)
  trainer.train()
  trainer.evaluate(test_dataset)
@MariaMegalli

This comment has been minimized.

Copy link

@MariaMegalli MariaMegalli commented Apr 7, 2021

Hi @jiahao87, I used the script to fine-tune the model on a dataset of mine but how can I use it to give it an input and have an output from it?

@jiahao87

This comment has been minimized.

Copy link
Owner Author

@jiahao87 jiahao87 commented Apr 8, 2021

Hi @jiahao87, I used the script to fine-tune the model on a dataset of mine but how can I use it to give it an input and have an output from it?

Hi @MariaMegalli, please refer to this link on how to perform inference. You will also need to modify the inference script to load the model saved.

@megadaabiss

This comment has been minimized.

Copy link

@megadaabiss megadaabiss commented May 2, 2021

Hi @jiahao87,
Thank for this script it's very helpful but I have 2 questions:

  1. Which pre-trained model are we fine-tuning with this script (C4 , HugeNews or Mixed & Stochastic) ?
  2. Do I have to upload my model to hugging face so that I could use this code ? if yes, which file from the checkpoint folder should I upload?
@jiahao87

This comment has been minimized.

Copy link
Owner Author

@jiahao87 jiahao87 commented May 4, 2021

Hi @jiahao87,
Thank for this script it's very helpful but I have 2 questions:

1. Which pre-trained model are we fine-tuning with this script (C4 , HugeNews or Mixed & Stochastic) ?

2. Do I have to upload my model to hugging face so that I could use [this](https://huggingface.co/transformers/model_doc/pegasus.html#usage-example) code ? if yes, which file from the checkpoint folder should I upload?

Hi @megadaabiss,

  1. The model we are using is pegasus-large, which according to Hugging Face's documentation, should be Mixed & Stochastic.
  2. No, you do not have to upload your fine-tuned model to Hugging Face. Just point the code to the checkpoint folder. E.g. below.
model = PegasusForConditionalGeneration.from_pretrained("results/checkpoint-2000").to(torch_device)
@megadaabiss

This comment has been minimized.

Copy link

@megadaabiss megadaabiss commented May 4, 2021

Hi @jiahao87,
Thank for this script it's very helpful but I have 2 questions:

1. Which pre-trained model are we fine-tuning with this script (C4 , HugeNews or Mixed & Stochastic) ?

2. Do I have to upload my model to hugging face so that I could use [this](https://huggingface.co/transformers/model_doc/pegasus.html#usage-example) code ? if yes, which file from the checkpoint folder should I upload?

Hi @megadaabiss,

  1. The model we are using is pegasus-large, which according to Hugging Face's documentation, should be Mixed & Stochastic.
  2. No, you do not have to upload your fine-tuned model to Hugging Face. Just point the code to the checkpoint folder. E.g. below.
model = PegasusForConditionalGeneration.from_pretrained("results/checkpoint-2000").to(torch_device)

I did as you suggested but unfortunately this error keeps showing up.
Screenshot (176)

@jiahao87

This comment has been minimized.

Copy link
Owner Author

@jiahao87 jiahao87 commented May 6, 2021

Hi @jiahao87,
Thank for this script it's very helpful but I have 2 questions:

1. Which pre-trained model are we fine-tuning with this script (C4 , HugeNews or Mixed & Stochastic) ?

2. Do I have to upload my model to hugging face so that I could use [this](https://huggingface.co/transformers/model_doc/pegasus.html#usage-example) code ? if yes, which file from the checkpoint folder should I upload?

Hi @megadaabiss,

  1. The model we are using is pegasus-large, which according to Hugging Face's documentation, should be Mixed & Stochastic.
  2. No, you do not have to upload your fine-tuned model to Hugging Face. Just point the code to the checkpoint folder. E.g. below.
model = PegasusForConditionalGeneration.from_pretrained("results/checkpoint-2000").to(torch_device)

I did as you suggested but unfortunately this error keeps showing up.
Screenshot (176)

Hi @megadaabiss, hope you've managed to figure it out. It's actually very simple. The only line of code that needs to be modified is the line that I have put in example. The rest remains unchanged.

@karimfayed

This comment has been minimized.

Copy link

@karimfayed karimfayed commented May 6, 2021

Hi @jiahao87,
Thank for this script it's very helpful but I have 2 questions:

1. Which pre-trained model are we fine-tuning with this script (C4 , HugeNews or Mixed & Stochastic) ?

2. Do I have to upload my model to hugging face so that I could use [this](https://huggingface.co/transformers/model_doc/pegasus.html#usage-example) code ? if yes, which file from the checkpoint folder should I upload?

Hi @megadaabiss,

  1. The model we are using is pegasus-large, which according to Hugging Face's documentation, should be Mixed & Stochastic.
  2. No, you do not have to upload your fine-tuned model to Hugging Face. Just point the code to the checkpoint folder. E.g. below.
model = PegasusForConditionalGeneration.from_pretrained("results/checkpoint-2000").to(torch_device)

I did as you suggested but unfortunately this error keeps showing up.
Screenshot (176)

Hi @megadaabiss, hope you've managed to figure it out. It's actually very simple. The only line of code that needs to be modified is the line that I have put in example. The rest remains unchanged.

I have the same problem although I uploaded my model on Huggingface. I think there is a problem with tokenizer file not being present in the checkpoint folders maybe??
Model: Karimfayed/pegasus_SAMSum

Note: I uploaded all files except the optimizer.pt

@MartinEmilEshack

This comment has been minimized.

Copy link

@MartinEmilEshack MartinEmilEshack commented May 8, 2021

Hello @jiahao87
Why is it that in the results folder there is no tokenizer files in each checkpoint?

@jiahao87

This comment has been minimized.

Copy link
Owner Author

@jiahao87 jiahao87 commented May 11, 2021

Hello @jiahao87
Why is it that in the results folder there is no tokenizer files in each checkpoint?

Hi @MartinEmilEshack, thank you for pointing that out. I have reviewed and edited the code accordingly so that the tokenizer will be saved along in the checkpoint folder. This should then make it more convenient to use the fine-tuned model.

Hi @megadaabiss, your previous approach for the model inference should now work with the new code since the tokenizer will now be saved in the checkpoint folder. Hope this helps.

@seregadgl20-oss

This comment has been minimized.

Copy link

@seregadgl20-oss seregadgl20-oss commented May 18, 2021

Tell me how to fine-tune another language?

@karimfayed

This comment has been minimized.

Copy link

@karimfayed karimfayed commented Jun 3, 2021

Hi @jiahao87,
I want to thank you for this script and your replies on any of the previous questions they have been of great help.
I have been trying to fine-tune PEGASUS for a long time on a specific dataset but unfortunately till now the output is not up to the standard.

So i was wondering if changing the warmup steps and weight decay values may provide better results and if so by what rate should I change them?

@jiahao87

This comment has been minimized.

Copy link
Owner Author

@jiahao87 jiahao87 commented Jun 4, 2021

Hi @jiahao87,
I want to thank you for this script and your replies on any of the previous questions they have been of great help.
I have been trying to fine-tune PEGASUS for a long time on a specific dataset but unfortunately till now the output is not up to the standard.

So i was wondering if changing the warmup steps and weight decay values may provide better results and if so by what rate should I change them?

Hi @karimfayed, glad that the script has been helpful.

Would like to help you where possible but unfortunately, your question is really better suited to be posted on StackExchange or Cross Validated. This comments section is more for script-related issues whereas your question is much more general about getting the desired model performance. You would also benefit more from having a larger community support if you post on those forums.

A minor suggestion from me is that you would be doing yourself a favour by elaborating your question and be as helpfully specific as possible. Provide more context (e.g., what dataset are you using? What are the current model training settings? If possible, specify more clearly what is the performance issue that you think your model is facing?). Your assumption is that changing warmup steps and weight decay may provide better results, but unless you have strong reason to believe so (in which case, you should have already experimented with model training by different weight decay or warmup steps values), boxing in on those two parameters in your question is unlikely going to help you solve your model performance issue.

All the best to you.

@pablo14

This comment has been minimized.

Copy link

@pablo14 pablo14 commented Jun 5, 2021

Many thanks for this script :)
Do you think is it possible to train the 'language model part' of pegasus, skipping the labeled data? This way the model will be more familiar with a custom corpus, without the need to provide the summary.

@jiahao87

This comment has been minimized.

Copy link
Owner Author

@jiahao87 jiahao87 commented Jun 6, 2021

Many thanks for this script :)
Do you think is it possible to train the 'language model part' of pegasus, skipping the labeled data? This way the model will be more familiar with a custom corpus, without the need to provide the summary.

Hi @pablo14, similar to previous question, your question is not really script-related so I shall not comment too much here. All I can say is that the next possibly best thing to do without providing labels is to perform something similar to what the PEGASUS authors did; i.e., using ROUGE-F1 score to get the "labels" from your custom corpus. However, this will probably only help with extractive summarization and not abstractive summarization.

@superlyc

This comment has been minimized.

Copy link

@superlyc superlyc commented Jun 9, 2021

Hi, I am experimenting with your script. I am quite new to huggingface trainer. Would you help answer one question? I don't understand why I will get OOM on GUDA, if I use the whole dataset, instead of using [:1000] as in the script. Without changing any other parameters (especially batch_size), why training on more data, will cause OOM. Thank you in advance.

@jiahao87

This comment has been minimized.

Copy link
Owner Author

@jiahao87 jiahao87 commented Jun 10, 2021

Hi, I am experimenting with your script. I am quite new to huggingface trainer. Would you help answer one question? I don't understand why I will get OOM on GUDA, if I use the whole dataset, instead of using [:1000] as in the script. Without changing any other parameters (especially batch_size), why training on more data, will cause OOM. Thank you in advance.

Hi @superlyc, the memory issue is probably not because of Hugging Face's trainer but it's because of our custom PegasusDataset.

Currently the whole encodings is loaded in our PegasusDataset. Compare that against the way the Dataset class is written here. Hence, you may need to rewrite this portion of the code and the way the data is to be loaded in order to reduce memory usage. Hope this clarifies.

class PegasusDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
@superlyc

This comment has been minimized.

Copy link

@superlyc superlyc commented Jun 10, 2021

Hi, I am experimenting with your script. I am quite new to huggingface trainer. Would you help answer one question? I don't understand why I will get OOM on GUDA, if I use the whole dataset, instead of using [:1000] as in the script. Without changing any other parameters (especially batch_size), why training on more data, will cause OOM. Thank you in advance.

Hi @superlyc, the memory issue is probably not because of Hugging Face's trainer but it's because of our custom PegasusDataset.

Currently the whole encodings is loaded in our PegasusDataset. Compare that against the way the Dataset class is written here. Hence, you may need to rewrite this portion of the code and the way the data is to be loaded in order to reduce memory usage. Hope this clarifies.

class PegasusDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings

Thank you

@rishav2416

This comment has been minimized.

Copy link

@rishav2416 rishav2416 commented Jun 23, 2021

Hi, @jiahao87, I have been trying to run your script in a Notebook instance in AWS sagemaker which has 8 GPU, each of 12 GB. Everytime, I am trying to run your script with absolutely no change, I am getting the following error: RuntimeError: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 0; 11.17 GiB total capacity; 10.49 GiB already allocated; 61.44 MiB free; 10.65 GiB reserved in total by PyTorch)
Could you please help?

@slvcsl

This comment has been minimized.

Copy link

@slvcsl slvcsl commented Jun 24, 2021

Hi @jiahao87, I have a couple of questions:

  1. What is the difference between this script and the summarization example (transformers/examples/seq2seq/run_summarization.py)? Is the example supposed to work with pegasus-large?
  2. How much GPU RAM is Pegasus large expected to use? I am currently trying to fine-tune the model (using the example, not this script, but I'll try it out) on big_patent on two RTX2080ti (11GB each) and get the OOM error even with input/output max_length = 10 and batch size = 1. Is this expected?

RuntimeError: CUDA out of memory. Tried to allocate 16.00 MiB (GPU 0; 10.76 GiB total capacity; 9.58 GiB already allocated; 2.62 MiB free; 9.77 GiB reserved in total by PyTorch)

@jiahao87

This comment has been minimized.

Copy link
Owner Author

@jiahao87 jiahao87 commented Jun 26, 2021

Hi, @jiahao87, I have been trying to run your script in a Notebook instance in AWS sagemaker which has 8 GPU, each of 12 GB. Everytime, I am trying to run your script with absolutely no change, I am getting the following error: RuntimeError: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 0; 11.17 GiB total capacity; 10.49 GiB already allocated; 61.44 MiB free; 10.65 GiB reserved in total by PyTorch)
Could you please help?

Hi @rishav2416, fine-tuning the full Pegasus large model is indeed resource intensive. I was only able to run the fine-tuning on Colab (GPU with 12GB RAM) when I freeze the encoder (see line below). Which notebook instance type are you using? You may wish to experiment with other instance types.

trainer = prepare_fine_tuning(model_name, tokenizer, train_dataset, freeze_encoder=True)
@jiahao87

This comment has been minimized.

Copy link
Owner Author

@jiahao87 jiahao87 commented Jun 26, 2021

Hi @jiahao87, I have a couple of questions:

1. What is the difference between this script and the summarization example (transformers/examples/seq2seq/run_summarization.py)? Is the example supposed to work with pegasus-large?

2. How much GPU RAM is Pegasus large expected to use? I am currently trying to fine-tune the model (using the example, not this script, but I'll try it out) on big_patent on two RTX2080ti (11GB each) and get the OOM error even with input/output max_length = 10 and batch size = 1. Is this expected?

RuntimeError: CUDA out of memory. Tried to allocate 16.00 MiB (GPU 0; 10.76 GiB total capacity; 9.58 GiB already allocated; 2.62 MiB free; 9.77 GiB reserved in total by PyTorch)

Hi @slvcsl, the main difference seems to be that the summarization example uses the Seq2SeqTrainer class, while this script uses the Trainer class. As pointed out here, the difference between these 2 classes is that Seq2SeqTrainer is a subclass of Trainer. You can read the link provided for details.

As for the memory usage, you may wish to refer to the above replies. Unfortunately, I do not have a specific number for the amount of GPU RAM that Pegasus large is expected to use. If anyone else reading this comment is able to chip in, please do so. Hope the earlier reply was able to help you to some extent.

@karimfayed

This comment has been minimized.

Copy link

@karimfayed karimfayed commented Jul 1, 2021

Hi @jiahao87, I have a couple of questions:

1. What is the difference between this script and the summarization example (transformers/examples/seq2seq/run_summarization.py)? Is the example supposed to work with pegasus-large?

2. How much GPU RAM is Pegasus large expected to use? I am currently trying to fine-tune the model (using the example, not this script, but I'll try it out) on big_patent on two RTX2080ti (11GB each) and get the OOM error even with input/output max_length = 10 and batch size = 1. Is this expected?

RuntimeError: CUDA out of memory. Tried to allocate 16.00 MiB (GPU 0; 10.76 GiB total capacity; 9.58 GiB already allocated; 2.62 MiB free; 9.77 GiB reserved in total by PyTorch)

Hi @slvcsl, the main difference seems to be that the summarization example uses the Seq2SeqTrainer class, while this script uses the Trainer class. As pointed out here, the difference between these 2 classes is that Seq2SeqTrainer is a subclass of Trainer. You can read the link provided for details.

As for the memory usage, you may wish to refer to the above replies. Unfortunately, I do not have a specific number for the amount of GPU RAM that Pegasus large is expected to use. If anyone else reading this comment is able to chip in, please do so. Hope the earlier reply was able to help you to some extent.

I had this problem early on and I was told that it is recommended to have 16 GB or more and for further help, this is the issue in which has also other recommendations.

@MariaMegalli

This comment has been minimized.

Copy link

@MariaMegalli MariaMegalli commented Jul 2, 2021

hello @jiahao87, can you please help me and explain why is it that sometimes the number of steps is double the number of epochs and sometimes it the same. For example:
For Batch size = 1 , training dateset = 1000 and epochs = 2000 the steps = 4000
while
For Batch size = 2 , training dataset = 1000 and epochs = 2000 the steps = 2000
and so can you also explain steps and their role as after these scenarios I got confused?

@jiahao87

This comment has been minimized.

Copy link
Owner Author

@jiahao87 jiahao87 commented Jul 2, 2021

Hi @jiahao87, I have a couple of questions:

1. What is the difference between this script and the summarization example (transformers/examples/seq2seq/run_summarization.py)? Is the example supposed to work with pegasus-large?

2. How much GPU RAM is Pegasus large expected to use? I am currently trying to fine-tune the model (using the example, not this script, but I'll try it out) on big_patent on two RTX2080ti (11GB each) and get the OOM error even with input/output max_length = 10 and batch size = 1. Is this expected?

RuntimeError: CUDA out of memory. Tried to allocate 16.00 MiB (GPU 0; 10.76 GiB total capacity; 9.58 GiB already allocated; 2.62 MiB free; 9.77 GiB reserved in total by PyTorch)

Hi @slvcsl, the main difference seems to be that the summarization example uses the Seq2SeqTrainer class, while this script uses the Trainer class. As pointed out here, the difference between these 2 classes is that Seq2SeqTrainer is a subclass of Trainer. You can read the link provided for details.
As for the memory usage, you may wish to refer to the above replies. Unfortunately, I do not have a specific number for the amount of GPU RAM that Pegasus large is expected to use. If anyone else reading this comment is able to chip in, please do so. Hope the earlier reply was able to help you to some extent.

I had this problem early on and I was told that it is recommended to have 16 GB or more and for further help, this is the issue in which has also other recommendations.

@karimfayed, thank you for the link. That was useful.

@jiahao87

This comment has been minimized.

Copy link
Owner Author

@jiahao87 jiahao87 commented Jul 2, 2021

hello @jiahao87, can you please help me and explain why is it that sometimes the number of steps is double the number of epochs and sometimes it the same. For example:
For Batch size = 1 , training dateset = 1000 and epochs = 2000 the steps = 4000
while
For Batch size = 2 , training dataset = 1000 and epochs = 2000 the steps = 2000
and so can you also explain steps and their role as after these scenarios I got confused?

@MariaMegalli, thank you for pointing this issue out. I have edited the code below and running the new code, the number of steps should now make sense. Let me know if you still encounter issues. Thank you.

    def __len__(self):
        return len(self.labels['input_ids'])
@karimfayed

This comment has been minimized.

Copy link

@karimfayed karimfayed commented Jul 6, 2021

@jiahao87, is there a way to convert the model after fine-tuning from pytorch to tensorflow so I can use it in a javascript backend ?

@MariaMegalli

This comment has been minimized.

Copy link

@MariaMegalli MariaMegalli commented Jul 11, 2021

hi @jiahao87, what is the default maximum input and output length in this script?

@jiahao87

This comment has been minimized.

Copy link
Owner Author

@jiahao87 jiahao87 commented Jul 14, 2021

@jiahao87, is there a way to convert the model after fine-tuning from pytorch to tensorflow so I can use it in a javascript backend ?

@karimfayed, try ONNX

@jiahao87

This comment has been minimized.

Copy link
Owner Author

@jiahao87 jiahao87 commented Jul 14, 2021

hi @jiahao87, what is the default maximum input and output length in this script?

Hi @MariaMegalli, please see Hugging Face's config here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment