Skip to content

Instantly share code, notes, and snippets.

@hsm207
Created October 1, 2019 23:50
Show Gist options
  • Save hsm207/143b6349ed1c92960be0dc1c6165d551 to your computer and use it in GitHub Desktop.
Save hsm207/143b6349ed1c92960be0dc1c6165d551 to your computer and use it in GitHub Desktop.
BERT_pretraining_share.ipynb
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@NeerajAI
Copy link

NeerajAI commented Dec 13, 2019

Hi Himanshu ,
thank you for the code .. this is really helpful :)

@hsm207
Copy link
Author

hsm207 commented Dec 14, 2019

Who is Himanshu?

@frank-lin-liu
Copy link

Hi hsm207,

for the below coding in the cell 13, you use deep_vocab.txt as a vocab_file. Could you please let me know how you construct this file?

!python /content/bert_repo/create_pretraining_data.py
--input_file=/content/training_data/training_data.txt
--output_file=/tmp/tf_examples.tfrecord
--vocab_file=/content/training_data/deep_vocab.txt
--do_lower_case=True
--max_seq_length=128
--max_predictions_per_seq=20
--masked_lm_prob=0.15
--random_seed=12345
--dupe_factor=5

@hsm207
Copy link
Author

hsm207 commented Dec 31, 2019

@frank-lin-liu

Hi,

I was helping out @gsasikiran debug his code. All the files are from him. You should approach him for the details

@gsasikiran
Copy link

Hello @frank-lin-liu,
You can check-out here

@frank-lin-liu
Copy link

Hi @gsasikiran

Thanks a lot.

@sahelimukherjee92
Copy link

Hi @gsasikiran I refered to https://github.com/kwonmha/bert-vocab-builder to get the deep_vocab.txt as you generated but it contains junks. I checked your files here and I am unable to replicate the same. Could you please help?

@gsasikiran
Copy link

@sahelimukherjee92 What do you mean by junks?

Hi @gsasikiran I refered to https://github.com/kwonmha/bert-vocab-builder to get the deep_vocab.txt as you generated but it contains junks. I checked your files here and I am unable to replicate the same. Could you please help?

@sahelimukherjee92
Copy link

@gsasikiran The vocab file that I generate has issue with punchtuations.
Here is a section of the vocabulary.
-(Q).
(Proc.
(Price,
(Poon
(Polyak,
(Polyak
(PoPPCA)
(Pinto
(Photo
(Pham
(Petersen
(Perron,
(Pearl,
(Pati
(Palatucci
(Paccanaro
(PSD)
(PMF).

Could you please suggest how can I get rid of the punctuation from the words and keep it separately as it is in your vocab file? What kind of preprocessing steps are involved if any?

@mshivasharan
Copy link

I am getting an error while running vocab builder.

Issue 1: fixed replacing 'tf.flags' by ' tf.compat.v1.flags' (Version issue)
Traceback (most recent call last):
File "./bert-vocab-builder/subword_builder.py", line 37, in
tf.flags.DEFINE_string('output_filename', '/tmp/my.subword_text_encoder',
AttributeError: module 'tensorflow' has no attribute 'flags'

Issue 2:
The number of files to read : 1
Traceback (most recent call last):
File "./bert-vocab-builder/subword_builder.py", line 86, in
tf.app.run()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "./bert-vocab-builder/subword_builder.py", line 67, in main
split_on_newlines=FLAGS.split_on_newlines, additional_chars=FLAGS.additional_chars)
File "/content/bert-vocab-builder/tokenizer.py", line 191, in corpus_token_counts
split_on_newlines=split_on_newlines):
File "/content/bert-vocab-builder/tokenizer.py", line 139, in _read_filepattern
tf.logging.INFO("Start reading ", filename)
TypeError: 'int' object is not callable

Could any one help please me out on this issue? Thanks in advance

@hsm207
Copy link
Author

hsm207 commented Dec 1, 2020

@mshivasharan what version of tensorflow are you using?

@mshivasharan
Copy link

@hsm207 initially I was running with default tensorflow version in colab 2.x.x. Later I have changed to 1.11.

@hsm207
Copy link
Author

hsm207 commented Dec 2, 2020

@mshivasharan do you know where the bert-vocab-builder folder come from?

@mshivasharan
Copy link

mshivasharan commented Dec 2, 2020 via email

@hsm207
Copy link
Author

hsm207 commented Dec 2, 2020

I think it's best to raise an issue on that repo.

@mshivasharan
Copy link

mshivasharan commented Dec 2, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment