Skip to content

Instantly share code, notes, and snippets.

@MarynaLongnickel
Created June 16, 2018 21:07
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save MarynaLongnickel/a576e4bd7d5d97a55d0ff18f5f58d785 to your computer and use it in GitHub Desktop.
Save MarynaLongnickel/a576e4bd7d5d97a55d0ff18f5f58d785 to your computer and use it in GitHub Desktop.
en_stopwords = list(set(nltk.corpus.stopwords.words('english')))
# remove punctuation from data
clean = [re.sub(r'[^\w\s]','',i).lower() for i in data]
tokens = [word_tokenize(x) for x in data['text']]
filtered_tokens = []
# tokens that are not stopwords collected here
for i in tokens:
filtered_tokens.append([])
for j in i:
if j in en_stopwords:
continue
else: filtered_tokens[-1].append(j)
# initialize Lancaster Stemmer
LS = LancasterStemmer()
lemmatized = []
for l in filtered_tokens: lemmatized.append([LS.stem(w) for w in l])
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment