Skip to content

Instantly share code, notes, and snippets.

@rloredo
Last active September 26, 2021 11:50
Show Gist options
  • Save rloredo/4d796219829fe0f20d562aefe0165e64 to your computer and use it in GitHub Desktop.
Save rloredo/4d796219829fe0f20d562aefe0165e64 to your computer and use it in GitHub Desktop.
How to use gensim doc2vec models
import gensim
#split train/test if necessary
end = -500
#docs is a pd.Series with lists of tokens representing each document
#don't forget to normalize tokens (to lower, strip accents, etc)
train = [gensim.models.doc2vec.TaggedDocument(d, [i]) for i, d in enumerate(docs.values[:end])]
test = docs.values[end:]
#doc2vec needs tagged docs
#Train model
model = gensim.models.doc2vec.Doc2Vec(vector_size=70, min_count=20, epochs=40, dm=0) #Check docs for meaning of params
model.build_vocab(train)
model.train(train, total_examples=model.corpus_count, epochs=model.epochs)
#Usage
#Search similar doc (using cosine sim)
model.docvecs.most_similar([model.infer_vector(test.values[0])], topn=100)
#You can use the vectors as classfier inputs
model.docvecs[i] #where i is the index of the labeled docs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment