Created February 22, 2020 13:41
First of, thanks so much for sharing this—it definitely helped me get a lot further along!
I was hoping to use my own tokenizer though, so I'm guessing the only way would be write the tokenizer, then just replace the LineByTextDataset() call in load_and_cache_examples() with my custom dataset, yes?

I think you mean custom “Dataset Loader”, because the code above already uses a custom tokenizer

jbmaxwell commented Feb 24, 2020

I see what you mean—a custom parametrization of the BPE tokenizer. But my use case is very specialized (music), so I actually want a very specific tokenization. But yes, I may be able to specify it the way you've done here. I'll think more about that. Thanks!
ps - I did get it running with the given tokenizer, so that's a huge step forward!

You’re welcome :)
Also, do share this gist on your network

jbmaxwell commented Feb 24, 2020

I'm struggling with trying to use a fixed vocabulary. My vocab.txt (for music) is small, and I want to avoid wordpieces, so that I don't have to predict multiple, adjacent pieces/tokens to get a "complete" prediction/"word". So all I want to do is load a vocab.txt and tokenize. Super simple, but I can't find a way to do that.
(If I can't find a way to do this, I'll just settle with the BPE tokenizer and figure out a way around the problems when I deploy it.)

Curious; is there a simple way to load weights and continue training?

mrm8488 commented Feb 28, 2020

Great work! I have executed the Colab you provided
and I got this error:

@mrm8488 should be fixed now thanks to huggingface/blog#8

008karan commented Mar 6, 2020

I want to train Albert than what changes I need to do in What changes would require that?

@julien-c as training for Albert like model requires the generation of pre-training data so is pre-training data generated while training itself?

Nix07 commented Mar 17, 2020

@jbmaxwell You can try other tokenizers like CharBPETokenizer, SentencePieceBPETokenizer, etc to check if that works for you.

To load weights and continue training, you can use the model_name_or_path parameter and point it to the latest checkpoint.

How do I have to preprocess the corpus when I want to train my own LM for roBERTa? I think it must be one sentence per row. But does it need empty lines between documents? Is it ok to shuffle the text line by line?

I get the following error

After running these codes

python /content/transformers/examples/language-modeling/
--model_type roberta
--train_data_file {1}
--eval_data_file {2}
--config_name /content/models/smallBERTa
--tokenizer_name /content/models/smallBERTa
--block_size 256
--learning_rate 1e-4
--num_train_epochs 5
--save_total_limit 2
--save_steps 2000
--logging_steps 500
--per_gpu_eval_batch_size 32
--per_gpu_train_batch_size 32
--seed 42
'''.format(weights_dir, train_path, eval_path)

Please let me know how to fix this error

I hope this isn't a silly question because I'm very new to NLP and AI in general. I find the advantages of a bytepiece encoder very enticing - and am hoping to continue pretraining Distilbert on a custom corpus.

Is it possible to:

  1. Train that bytepiece encoder on the dataset
  2. Load it in with Distilbert (From HF's checkpoint)
  3. Continue pretraining Distilbert with the bytepiece tokenizer on custom corpus?

NianzuMa commented Aug 2, 2020

Hi, I have a question regarding the training file for the tokenizer.
At the beginning of the tutorial, it says:

To the Tokenizer:
LM data in a directory containing all samples in separate *.txt files.

Also there is code snippet:

for row in tqdm(data.to_list()):
  file_name = os.path.join(txt_files_dir, str(i)+'.txt')
    f = open(file_name, 'w')
  except Exception as e:  #catch exceptions(for eg. empty rows)
    print(row, e) 

What this does is to separate each sentence into a single file, rather than put 200_000 sentences line by line in a single file.

On contrast, in this tutorial:

the file oscar.eo.txt contains all sentences line by line in a single file.

I tried to search for the documentation but have no clue which way to do is correct.

Is it necessary to split each sentence into one file, which results in 200_000 files?

Thank you for your answer.

I'm kinda new to this, but playing a bit around with the code I noticed that the function call "" should be changed to "tokenizer.save_model()".

You let me know wether my hunch is correct. :)

I get this error in line 20

@carlstrath In recent versions of tokenizers I think you can just call .save(path) (cc @n1t0)

carlstrath commented Jan 29, 2021

Sorry to bother everyone again. I am now getting this error in ln27

python3: can't open file '/content/transformers/examples/': [Errno 2] No such file or directory

Hi @carlstrath,
(Sorry I’ve been a bit busy lately so wasn’t active).

This gist was made for a specific version of the transformer and tokenizer library. Can you try using it with the versions mentioned at the start.

Meanwhile, I guess it’s about time now that I update this gist to reflect changes in the dependencies.

aditya-malte commented Jan 29, 2021

Also, while cloning from git. Please ensure you use this ( github repo instead. (As the gist is compatible with that version of huggingface, the newer one probably doesn’t contain the required run_language_modeling file)

sv-v5 commented Sep 11, 2021

I ran into issues while following the directions from the 2020 blog post This gist was more helpful. Thank you 👍

For anyone interested in running through training with an updated transformers: I have a write-up here of a complete example on training from scratch using transformers 4.10 and the updated script ( committed on Jun 25, 2021.

Python package verisions are locked with pipenv so the example remains reproducible. Tested on Linux and Windows on GPU and CPU.

Happy training

That’s great to hear! Also, thanks a lot for making an updated training script. I’ve been busy lately (earlier with work and now my Master’s), so your updated script is much appreciated.

I had to update step #26 from to tokenizer.save_model. FYI

tokenizer.save_model("/content/models/smallBERTa", "smallBERTa")

