Skip to content

Instantly share code, notes, and snippets.

@lhoestq
Created June 15, 2020 19:08
Show Gist options
  • Save lhoestq/8f317e47c6f8b6bc50ef1275f655a3a3 to your computer and use it in GitHub Desktop.
Save lhoestq/8f317e47c6f8b6bc50ef1275f655a3a3 to your computer and use it in GitHub Desktop.
english wikipedia length
from nlp import load_dataset
from tqdm.auto import tqdm
wiki = load_dataset('wikipedia', '20200501.en', split="train")
batch_size = 1000
total_length = 0
for i in tqdm(range(0, len(wiki), batch_size)): # loop takes ~1min to run
batch = wiki[i:i + batch_size]
total_length += sum(len(sample_text) for sample_text in batch["text"])
print(total_length)
# >>> 18067578318
@thomwolf
Copy link

To run this gist, install 🤗nlp with

pip install nlp

And check the details at https://github.com/huggingface/nlp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment