Skip to content

Instantly share code, notes, and snippets.

@yanshengjia
Created March 1, 2018 15:14
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save yanshengjia/9b84178e32f5d07f4fb41fcedc016ca3 to your computer and use it in GitHub Desktop.
Save yanshengjia/9b84178e32f5d07f4fb41fcedc016ca3 to your computer and use it in GitHub Desktop.
A python script which cleans the raw corpus
import re
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
def clean(raw_str):
en_stopwords = set(stopwords.words('english'))
lemma = WordNetLemmatizer()
lower_str = raw_str.lower()
punc_free_str = ' '.join(re.findall(r'\w+', lower_str))
stop_free_str = ' '.join([i for i in punc_free_str.split() if i not in en_stopwords])
cleaned_str = ' '.join(lemma.lemmatize(word) for word in stop_free_str.split())
return cleaned_str
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment