Skip to content

Instantly share code, notes, and snippets.

@Smerity
Created February 9, 2017 23:00
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save Smerity/34e57f258cea48dba4c93d2261fc9330 to your computer and use it in GitHub Desktop.
Save Smerity/34e57f258cea48dba4c93d2261fc9330 to your computer and use it in GitHub Desktop.
Count the number of unique tokens in WikiText-2 and/or WikiText-103
vocab = set()
for i, line in enumerate(open('wiki.train.tokens')):
words = [x for x in line.split(' ') if x]
[vocab.add(word) for word in words]
if i < 10: print(words)
print('Vocab size:', len(vocab))
# Returns 33,278 for WikiText-2
# Returns 267,735 for WikiText-103
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment