Skip to content

Instantly share code, notes, and snippets.

Last active Aug 25, 2020
What would you like to do?
from nltk.tokenize import word_tokenize
# Tokenization of each document
tokenized_sent = []
for s in sentences:
# import
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
tagged_data = [TaggedDocument(d, [i]) for i, d in enumerate(tokenized_sent)]
## Train doc2vec model
model = Doc2Vec(tagged_data, vector_size = 20, window = 2, min_count = 1, epochs = 100)
vector_size = Dimensionality of the feature vectors.
window = The maximum distance between the current and predicted word within a sentence.
min_count = Ignores all words with total frequency lower than this.
alpha = The initial learning rate.
## Print model vocabulary
test_doc = word_tokenize("I had pizza and pasta".lower())
test_doc_vector = model.infer_vector(test_doc)
model.docvecs.most_similar(positive = [test_doc_vector])
positive = List of sentences that contribute positively.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment