Skip to content

Instantly share code, notes, and snippets.

@cstorm125
Created February 26, 2020 01:29
Show Gist options
  • Save cstorm125/ad856a1eb9e806afb6410b9a334883b2 to your computer and use it in GitHub Desktop.
Save cstorm125/ad856a1eb9e806afb6410b9a334883b2 to your computer and use it in GitHub Desktop.
from pathlib import Path
from tokenizers import ByteLevelBPETokenizer
paths = [str(x) for x in Path("cleaned_data/oscar").glob("**/*.txt")]
# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer()
# Customize training
tokenizer.train(files=paths, vocab_size=50_000, min_frequency=2, special_tokens=[
"<s>",
"<pad>",
"</s>",
"<unk>",
"<mask>",
])
# Save files to disk
tokenizer.save(".", "thai")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment