Skip to content

Instantly share code, notes, and snippets.

@morkapronczay
Created October 14, 2019 11:19
Show Gist options
  • Save morkapronczay/6aba5b231cf34071f52bf29e4354e2fd to your computer and use it in GitHub Desktop.
Save morkapronczay/6aba5b231cf34071f52bf29e4354e2fd to your computer and use it in GitHub Desktop.
from nltk.corpus import stopwords
# create one big list per language for easier handling
text_bylang = {lan: sum([val for key, val in texts_split[lan].items()], []) for lan in languages}
# long format of languages for stopword identification
languages_long = {'en': 'english', 'de': 'german', 'hu': 'hungarian', 'ro': 'romanian'}
# create dict of stopwords by language
stopwords_bylang = {lan: set(stopwords.words(languages_long[lan])) for lan in languages}
# filter stopwords from text
text_bylang_stop = {lan: [f for f in text_bylang[lan] if not f in stopwords_bylang[lan]] for lan in languages}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment