Skip to content

Instantly share code, notes, and snippets.

@brandonko
Created September 3, 2020 03:52
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save brandonko/e0cfbb2f9ea0f9c835a71b3d98a8d7e0 to your computer and use it in GitHub Desktop.
Save brandonko/e0cfbb2f9ea0f9c835a71b3d98a8d7e0 to your computer and use it in GitHub Desktop.
Stop Word Removal, Lemmatization, and Stemming
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem import SnowballStemmer
nltk.download('stopwords')
nltk.download('wordnet')
# Remove all stopwords
stop_words = stopwords.words('english')
def remove_stopwords(tokenized_sentences):
for sentence in tokenized_sentences:
yield([token for token in sentence if token not in stop_words])
# Lemmatize all words
wordnet_lemmatizer = WordNetLemmatizer()
def lemmatize_words(tokenized_sentences):
for sentence in tokenized_sentences:
yield([wordnet_lemmatizer.lemmatize(token) for token in sentence])
snowball_stemmer = SnowballStemmer('english')
def stem_words(tokenized_sentences):
for sentence in tokenized_sentences:
yield([snowball_stemmer.stem(token) for token in sentence])
sentences = list(remove_stopwords(sentences))
sentences = list(lemmatize_words(sentences))
sentences = list(stem_words(sentences))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment