Skip to content

Instantly share code, notes, and snippets.

@MaartenGr
Created October 15, 2020 11:53
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save MaartenGr/44309619f1b14e2865cdb513ce5afd1f to your computer and use it in GitHub Desktop.
Save MaartenGr/44309619f1b14e2865cdb513ce5afd1f to your computer and use it in GitHub Desktop.
# Get data
from sklearn.datasets import fetch_20newsgroups
newsgroups = fetch_20newsgroups(subset='train', remove=('headers', 'footers', 'quotes'))
# Create documents per label
docs = pd.DataFrame({'Document': newsgroups.data, 'Class': newsgroups.target})
docs_per_class = docs.groupby(['Class'], as_index=False).agg({'Document': ' '.join})
# Create c-TF-IDF
count = CountVectorizer().fit_transform(docs_per_class.Document)
ctfidf = CTFIDFVectorizer().fit_transform(count, n_samples=len(docs))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment