Skip to content

Instantly share code, notes, and snippets.

@purva91
Last active May 13, 2022 18:01
Show Gist options
  • Save purva91/01232a0140048647f3f7c0561f1f2264 to your computer and use it in GitHub Desktop.
Save purva91/01232a0140048647f3f7c0561f1f2264 to your computer and use it in GitHub Desktop.
Doc2Vec
from nltk.tokenize import word_tokenize
# Tokenization of each document
tokenized_sent = []
for s in sentences:
tokenized_sent.append(word_tokenize(d.lower()))
tokenized_sent
# import
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
tagged_data = [TaggedDocument(d, [i]) for i, d in enumerate(tokenized_sent)]
tagged_data
## Train doc2vec model
model = Doc2Vec(tagged_data, vector_size = 20, window = 2, min_count = 1, epochs = 100)
'''
vector_size = Dimensionality of the feature vectors.
window = The maximum distance between the current and predicted word within a sentence.
min_count = Ignores all words with total frequency lower than this.
alpha = The initial learning rate.
'''
## Print model vocabulary
model.wv.vocab
test_doc = word_tokenize("I had pizza and pasta".lower())
test_doc_vector = model.infer_vector(test_doc)
model.docvecs.most_similar(positive = [test_doc_vector])
'''
positive = List of sentences that contribute positively.
'''
@BoianAlexandrov
Copy link

model.wv.vocab (line 12 3_train_save_load.py has to be substitute by: model.wv.key_to_index in gensim 4.0.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment