Skip to content

Instantly share code, notes, and snippets.

@himangSharatun
Created February 9, 2018 07:36
Show Gist options
  • Save himangSharatun/96da8c53da1a28af91f9cc1477db3be9 to your computer and use it in GitHub Desktop.
Save himangSharatun/96da8c53da1a28af91f9cc1477db3be9 to your computer and use it in GitHub Desktop.
from sklearn.feature_extraction.text import CountVectorizer
import json
import pandas
import numpy
corpus_path = 'data/training/training-data.csv'
# prepare training data for bow (corpus)
X_training = []
dataframe = pandas.read_csv(corpus_path, header=None)
for i in xrange(len(dataframe[0])):
X_training.append(dataframe[0][i])
sentences = numpy.array(X_training)
# create bow vocabulary
vectorizer = CountVectorizer()
vectorizer.fit_transform(sentences).todense()
# save vocabulary to json file
with open ('vocabulary.json', 'w') as vocabFile:
json.dump(vectorizer.vocabulary_ , vocabFile)
print "vocabulary is saved"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment