Skip to content

Instantly share code, notes, and snippets.

@morkapronczay
Created October 11, 2019 13:44
Show Gist options
  • Save morkapronczay/85ba323f992c2042199b58b9a9cf59f4 to your computer and use it in GitHub Desktop.
Save morkapronczay/85ba323f992c2042199b58b9a9cf59f4 to your computer and use it in GitHub Desktop.
from nltk.tokenize import RegexpTokenizer
# tokenized text - remove punctuation
tokenizer = RegexpTokenizer(r'\w+')
texts_split = {lan: {key: tokenizer.tokenize(text) for key, text in texts[lan].items()} for lan in languages}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment