Skip to content

Instantly share code, notes, and snippets.

@shelvacu
Created March 12, 2022 11:46
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save shelvacu/30b920511e30675d13b5189a02688a8c to your computer and use it in GitHub Desktop.
Save shelvacu/30b920511e30675d13b5189a02688a8c to your computer and use it in GitHub Desktop.
Just download all google ngrams eng 2020
for f in {1-000{00..23}-of-00024,2-{00000..00588}-of-00589,3-{00000..06880}-of-06881,4-{00000..06667}-of-06668,5-{00000..19422}-of-19423}.gz
do
if [ -f "$f" ]
then
echo "Skipping $f"
else
wget 'http://storage.googleapis.com/books/ngrams/books/20200217/eng/'$f || break
fi
done
@shelvacu
Copy link
Author

shelvacu commented Mar 31, 2022

Since google's ngram docs never bothers to mention something as miniscule a detail as how many terabytes you'll need to download and store...

1-grams:  13.4GB (    13,424,928,417 bytes)
2-grams:   324GB (   323,824,694,291 bytes)
3-grams:  3.35TB ( 3,354,434,461,971 bytes)
4-grams:  3.02TB ( 3,022,999,522,388 bytes)
5-grams:  8.18TB ( 8,179,330,805,793 bytes)
  total: 14.89TB (14,894,014,412,860 bytes)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment