Skip to content

Instantly share code, notes, and snippets.

@aneesha
Created September 1, 2016 00:13
Show Gist options
  • Save aneesha/a09c8f1c51c5db191ca10e39446f97e8 to your computer and use it in GitHub Desktop.
Save aneesha/a09c8f1c51c5db191ca10e39446f97e8 to your computer and use it in GitHub Desktop.
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
no_features = 1000
# NMF is able to use tf-idf
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, max_features=no_features, stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(documents)
tfidf_feature_names = tfidf_vectorizer.get_feature_names()
# LDA can only use raw term counts for LDA because it is a probabilistic graphical model
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=no_features, stop_words='english')
tf = tf_vectorizer.fit_transform(documents)
tf_feature_names = tf_vectorizer.get_feature_names()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment