Skip to content

Instantly share code, notes, and snippets.

@negedng
Created October 18, 2020 21:36
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save negedng/839d3520714bf191ae1029099cc05c90 to your computer and use it in GitHub Desktop.
Save negedng/839d3520714bf191ae1029099cc05c90 to your computer and use it in GitHub Desktop.
with open("imdb_train_plain_lines.txt",'w') as f:
for examples in ds2_train:
f.write(examples['text'])
f.write('\n')
from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers, processors, BertWordPieceTokenizer
# Initialize a tokenizer
tokenizer = BertWordPieceTokenizer()
# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.BertPreTokenizer()
tokenizer.decoder = decoders.WordPiece()
# And then train
tokenizer.train([
"imdb_train_plain_lines.txt"
], vocab_size=max_features, min_frequency=1)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment