Skip to content

Instantly share code, notes, and snippets.

@Amber0914
Last active September 22, 2018 10:10
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save Amber0914/f02ab0386be8105ba62c8e7d7866c769 to your computer and use it in GitHub Desktop.
Save Amber0914/f02ab0386be8105ba62c8e7d7866c769 to your computer and use it in GitHub Desktop.
from sklearn.feature_extraction.text import CountVectorizer
train_X = ["John likes to watch movies",
"Mary likes movies too",
"Joe only likes horror movies and action movies"]
vectorizer = CountVectorizer(token_pattern=r'\b\w+\b') # take a word as a token.
train_vector = vectorizer.fit_transform(train_X) # Learn the vocabulary dictionary and return term-document matrix.
token_set = vectorizer.get_feature_names() # the vocabulary dictionary: ['action', 'and', 'horror', 'joe', 'john', 'likes', 'mary', 'movies', 'only', 'to', 'too', 'watch']
test_X = ["Jay likes romantic movies"]
test_vector = vectorizer.transform(test_X)
print(test_vector)
'''
(0, 5) 1
(0, 7) 1
'''
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment