Skip to content

Instantly share code, notes, and snippets.

@MaxHalford
Created November 14, 2016 16:17
Show Gist options
  • Save MaxHalford/68b584e9154098151e6d9b5aa7464948 to your computer and use it in GitHub Desktop.
Save MaxHalford/68b584e9154098151e6d9b5aa7464948 to your computer and use it in GitHub Desktop.
Lemmatization
import nltk
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer
def tokenize(text):
text = ''.join([ch for ch in text if ch not in string.punctuation])
tokens = nltk.word_tokenize(text)
lemmatizer = WordNetLemmatizer()
return [lemmatizer.lemmatize(token) for token in tokens]
vectorizer = CountVectorizer(tokenizer=tokenize, stop_words='english')
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment