Skip to content

Instantly share code, notes, and snippets.

@alinazhanguwo
Created April 24, 2019 19:40
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save alinazhanguwo/59bf5a47212723df1fac15208bf8a8e8 to your computer and use it in GitHub Desktop.
Save alinazhanguwo/59bf5a47212723df1fac15208bf8a8e8 to your computer and use it in GitHub Desktop.
from sklearn.feature_extraction.text import TfidfVectorizer
def tfidf_features(X_train, X_val, X_test):
"""
X_train, X_val, X_test - input text
return TF-IDF vectorizer for each dataset
"""
# filter out too rare words (occur less than in 5 titles) and too frequent words (occur more than in 90% of the tweets)
# ngram!!! --> ngram_range=(1,2)
tfidf_vectorizer = TfidfVectorizer(ngram_range=(1,2), max_df=0.9, min_df=5, token_pattern='(\S+)')
# Fit and transform the vectorizer on the train set
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
# Transform the test and val sets
X_val_tfidf = tfidf_vectorizer.transform(X_val)
X_test_tfidf = tfidf_vectorizer.transform(X_test)
return X_train_tfidf, X_val_tfidf, X_test_tfidf, tfidf_vectorizer.vocabulary_
X_train_tfidf, X_val_tfidf, X_test_tfidf, tfidf_vocab = tfidf_features(X_train, X_val, X_test)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment