Skip to content

Instantly share code, notes, and snippets.

@msjgriffiths
Created March 7, 2019 04:11
Show Gist options
  • Save msjgriffiths/20dc4c4c6292e962bd41c6851316ae42 to your computer and use it in GitHub Desktop.
Save msjgriffiths/20dc4c4c6292e962bd41c6851316ae42 to your computer and use it in GitHub Desktop.
Quick check of scores on baseline logit
from tensorflow.keras.datasets import imdb
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
(x_train, y_train), (x_test, y_test) = imdb.load_data()
w2i = imdb.get_word_index()
i2w = {v: k for k, v in w2i.items()}
review_train = [" ".join([i2w[i] if i in i2w else '' for i in x]) for x in x_train]
review_test = [" ".join([i2w[i] if i in i2w else '' for i in x]) for x in x_test]
transformer = TfidfVectorizer(review_train, ngram_range=(1, 4))
sk_train = transformer.fit_transform(review_train)
sk_test = transformer.transform(review_test)
clf = LogisticRegression()
clf.fit(sk_train, y_train)
preds = clf.predict_proba(sk_test)
# Accuracy
clf.score(sk_test, y_test)
# AUC
roc_auc_score(y_test, preds[:, 1])
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment