Skip to content

Instantly share code, notes, and snippets.

@rajy4683
Last active February 7, 2021 10:22
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rajy4683/b7de4d5ebe3bdf3ab74c1e4924bb1670 to your computer and use it in GitHub Desktop.
Save rajy4683/b7de4d5ebe3bdf3ab74c1e4924bb1670 to your computer and use it in GitHub Desktop.
SRC.build_vocab(train_data, min_freq = 2)
TRG.build_vocab(train_data, min_freq = 2)
#### Analyze the vocabulary
print("Length of German(SRC) Vocab: ",len(SRC.vocab.stoi))
print("Length of English(TRG) Vocab: ",len(TRG.vocab.stoi))
print("Top 5 frequent tokens in German Vocab", list(SRC.vocab.freqs.most_common()[:10]))
print("Top 5 frequent tokens in English Vocab", list(TRG.vocab.freqs.most_common()[:10]))
### Vocab output
"""
Length of German(SRC) Vocab: 7855
Length of English(TRG) Vocab: 5893
Top 5 frequent tokens in German Vocab [('.', 28821), ('ein', 18850), ('einem', 13711), ('in', 11893), ('eine', 9908),
(',', 8938), ('und', 8925), ('mit', 8843), ('auf', 8745), ('mann', 7805)]
Top 5 frequent tokens in English Vocab [('a', 49165), ('.', 27623), ('in', 14886), ('the', 10955),
('on', 8035), ('man', 7781), ('is', 7525), ('and', 7379), ('of', 6871), ('with', 6179)]
"""
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment