-
-
Save hsm207/143b6349ed1c92960be0dc1c6165d551 to your computer and use it in GitHub Desktop.
Who is Himanshu?
Hi hsm207,
for the below coding in the cell 13, you use deep_vocab.txt as a vocab_file. Could you please let me know how you construct this file?
!python /content/bert_repo/create_pretraining_data.py
--input_file=/content/training_data/training_data.txt
--output_file=/tmp/tf_examples.tfrecord
--vocab_file=/content/training_data/deep_vocab.txt
--do_lower_case=True
--max_seq_length=128
--max_predictions_per_seq=20
--masked_lm_prob=0.15
--random_seed=12345
--dupe_factor=5
Hi,
I was helping out @gsasikiran debug his code. All the files are from him. You should approach him for the details
Hello @frank-lin-liu,
You can check-out here
Hi @gsasikiran
Thanks a lot.
Hi @gsasikiran I refered to https://github.com/kwonmha/bert-vocab-builder to get the deep_vocab.txt as you generated but it contains junks. I checked your files here and I am unable to replicate the same. Could you please help?
@sahelimukherjee92 What do you mean by junks?
Hi @gsasikiran I refered to https://github.com/kwonmha/bert-vocab-builder to get the deep_vocab.txt as you generated but it contains junks. I checked your files here and I am unable to replicate the same. Could you please help?
@gsasikiran The vocab file that I generate has issue with punchtuations.
Here is a section of the vocabulary.
-(Q).
(Proc.
(Price,
(Poon
(Polyak,
(Polyak
(PoPPCA)
(Pinto
(Photo
(Pham
(Petersen
(Perron,
(Pearl,
(Pati
(Palatucci
(Paccanaro
(PSD)
(PMF).
Could you please suggest how can I get rid of the punctuation from the words and keep it separately as it is in your vocab file? What kind of preprocessing steps are involved if any?
I am getting an error while running vocab builder.
Issue 1: fixed replacing 'tf.flags' by ' tf.compat.v1.flags' (Version issue)
Traceback (most recent call last):
File "./bert-vocab-builder/subword_builder.py", line 37, in
tf.flags.DEFINE_string('output_filename', '/tmp/my.subword_text_encoder',
AttributeError: module 'tensorflow' has no attribute 'flags'
Issue 2:
The number of files to read : 1
Traceback (most recent call last):
File "./bert-vocab-builder/subword_builder.py", line 86, in
tf.app.run()
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "./bert-vocab-builder/subword_builder.py", line 67, in main
split_on_newlines=FLAGS.split_on_newlines, additional_chars=FLAGS.additional_chars)
File "/content/bert-vocab-builder/tokenizer.py", line 191, in corpus_token_counts
split_on_newlines=split_on_newlines):
File "/content/bert-vocab-builder/tokenizer.py", line 139, in _read_filepattern
tf.logging.INFO("Start reading ", filename)
TypeError: 'int' object is not callable
Could any one help please me out on this issue? Thanks in advance
@mshivasharan what version of tensorflow are you using?
@hsm207 initially I was running with default tensorflow version in colab 2.x.x. Later I have changed to 1.11.
@mshivasharan do you know where the bert-vocab-builder
folder come from?
I think it's best to raise an issue on that repo.
Hi Himanshu ,
thank you for the code .. this is really helpful :)