Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save piegu/703d2124e19c6b961babebc84029ce64 to your computer and use it in GitHub Desktop.
Save piegu/703d2124e19c6b961babebc84029ce64 to your computer and use it in GitHub Desktop.
download of Byte-Level-BPE_universal_tokenizer_but.ipynb
Display the source blob
Display the rendered blob
Raw
# Download Wikipedia in Portuguese (zip of 1.62Go)
# duration: 40m 30s
get_wiki(path_data,lang)
# Split global download file to one article by text file
dest = split_wiki(path_data,lang)
# Size of downloaded data
num_files, num_tokens = get_num_tokens(dest)
print(f'{num_files} files - {num_tokens} tokens')
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment