Last active
September 26, 2021 11:50
-
-
Save rloredo/4d796219829fe0f20d562aefe0165e64 to your computer and use it in GitHub Desktop.
How to use gensim doc2vec models
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import gensim | |
#split train/test if necessary | |
end = -500 | |
#docs is a pd.Series with lists of tokens representing each document | |
#don't forget to normalize tokens (to lower, strip accents, etc) | |
train = [gensim.models.doc2vec.TaggedDocument(d, [i]) for i, d in enumerate(docs.values[:end])] | |
test = docs.values[end:] | |
#doc2vec needs tagged docs | |
#Train model | |
model = gensim.models.doc2vec.Doc2Vec(vector_size=70, min_count=20, epochs=40, dm=0) #Check docs for meaning of params | |
model.build_vocab(train) | |
model.train(train, total_examples=model.corpus_count, epochs=model.epochs) | |
#Usage | |
#Search similar doc (using cosine sim) | |
model.docvecs.most_similar([model.infer_vector(test.values[0])], topn=100) | |
#You can use the vectors as classfier inputs | |
model.docvecs[i] #where i is the index of the labeled docs |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment