Skip to content

Instantly share code, notes, and snippets.

@deansublett
Created June 5, 2019 18:44
Show Gist options
  • Save deansublett/d04c9070264575bcfc8f846f0c9b9f73 to your computer and use it in GitHub Desktop.
Save deansublett/d04c9070264575bcfc8f846f0c9b9f73 to your computer and use it in GitHub Desktop.
from sklearn.feature_extraction.text import TfidfVectorizer
# Using Abhishek Thakur's arguments for TF-IDF
tfv = TfidfVectorizer(min_df=3, max_features=None,
strip_accents='unicode', analyzer='word',token_pattern=r'\w{1,}',
ngram_range=(1, 3), use_idf=1,smooth_idf=1,sublinear_tf=1,
stop_words = 'english')
# Filling NaNs with empty string
movies_clean['overview'] = movies_clean['overview'].fillna('')
# Fitting the TF-IDF on the 'overview' text
tfv_matrix = tfv.fit_transform(movies_clean['overview'])
tfv_matrix.shape
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment