Skip to content

Instantly share code, notes, and snippets.

@Shivampanwar
Last active September 12, 2019 07:38
Show Gist options
  • Save Shivampanwar/bea7108054c4e6dd8b37e4301d9d3eba to your computer and use it in GitHub Desktop.
Save Shivampanwar/bea7108054c4e6dd8b37e4301d9d3eba to your computer and use it in GitHub Desktop.
Fine-tunes Bert language model on Google Colab
##combine train and test reviews
lm_df = pd.concat([train_df[['review']],test_df[['review']]])
lm_df.review = lm_df.review.str.lower()
tqdm.pandas()
## We need one sentence per line with space between two lines
changed_text=lm_df.review.apply(lambda x:x+"\n"+"\n")
open(os.path.join(directory_path,'data_lm.txt'), "w").write(''.join(changed_text))
##We need to create data in Bert Format, below command will do that
!python3 pregenerate_training_data.py --train_corpus data_lm.txt --bert_model bert-base-uncased --do_lower_case --output_dir training/ --epochs_to_generate 2 --max_seq_len 256
##We would now use this data to finetune the model
!python3 finetune_on_pregenerated.py --pregenerated_data training/ --bert_model bert-base-uncased --do_lower_case --train_batch_size 16 --output_dir finetuned_lm/ --epochs 2
##Now, our model is ready to be used.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment