Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save Tathagatd96/4204d1567960bb60d7ea1127d5c14876 to your computer and use it in GitHub Desktop.
Save Tathagatd96/4204d1567960bb60d7ea1127d5c14876 to your computer and use it in GitHub Desktop.
part one
sklearn.datasets.load_files("C://Users/Tathagat Dasgupta/Desktop/ML Project/20news-18828")
categories=['alt.atheism','soc.religion.christian','comp.graphics','sci.med']
print "hello"
twenty_train=fetch_20newsgroups(subset='train',categories=categories,shuffle=True,random_state=42)
#twenty_train.target_names=['alt.atheism','comp.graphics','sci.med','soc.religion.christian']
print len(twenty_train.data)
print("\n".join(twenty_train.data[0].split("\n")[:3]))
print(twenty_train.target_names[twenty_train.target[0]])
print(twenty_train.target[:10])
for t in twenty_train.target[:10]:
print(twenty_train.target_names[t])
#Preprocessing
#Tokenizing text
count_vect=CountVectorizer()
X_train_counts=count_vect.fit_transform(twenty_train.data)
print(X_train_counts.shape)
print(count_vect.vocabulary_.get(u'algorithm'))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment