Skip to content

Instantly share code, notes, and snippets.

@tanosaur
Created April 19, 2020 16:07
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save tanosaur/f8627f274b3f7d7bfdfdecd37a575cef to your computer and use it in GitHub Desktop.
Save tanosaur/f8627f274b3f7d7bfdfdecd37a575cef to your computer and use it in GitHub Desktop.
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn import metrics
newsgroups_train = fetch_20newsgroups(subset='train')
newsgroups_test = fetch_20newsgroups(subset='test')
pipeline = Pipeline([
('vect', CountVectorizer(ngram_range=(1, 2), max_df=0.5)),
('tfidf', TfidfTransformer(sublinear_tf=True)),
('clf', SGDClassifier(loss='hinge', penalty='l2',
alpha=0.0001, max_iter=50, tol=0.0005)),
])
print('training')
pipeline.fit(newsgroups_train.data, newsgroups_train.target)
predicted = pipeline.predict(newsgroups_test.data)
accuracy = metrics.accuracy_score(newsgroups_test.target, predicted)
print(accuracy) # 0.865
# Other things to explore via supervised bag-of-words method:
# - Truncated singular value decomposition and latent semantic analysis
# - CountVectorizer analyzer and n-grams
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment