Skip to content

Instantly share code, notes, and snippets.

@ahmedshahriar
Created November 21, 2020 04:16
Show Gist options
  • Save ahmedshahriar/da727b45fc758ada0baa5a69fe946826 to your computer and use it in GitHub Desktop.
Save ahmedshahriar/da727b45fc758ada0baa5a69fe946826 to your computer and use it in GitHub Desktop.
"""
@ github.com/ahmedshahriar
This code will remove remove non-English words from text
"""
import nltk
# download nltk english corpus
nltk.download('wordnet')
wordnet = set(nltk.corpus.wordnet.words())
words = "love ে ী ে া ে ু ্ া োঁ ে"
words_cleaned = " ".join(w for w in nltk.wordpunct_tokenize(words) if w.lower() in wordnet )
print(words_cleaned)
# output 'love'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment