Skip to content

Instantly share code, notes, and snippets.

What would you like to do?
A python script which cleans the raw corpus
import re
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
def clean(raw_str):
en_stopwords = set(stopwords.words('english'))
lemma = WordNetLemmatizer()
lower_str = raw_str.lower()
punc_free_str = ' '.join(re.findall(r'\w+', lower_str))
stop_free_str = ' '.join([i for i in punc_free_str.split() if i not in en_stopwords])
cleaned_str = ' '.join(lemma.lemmatize(word) for word in stop_free_str.split())
return cleaned_str
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.