Skip to content

Instantly share code, notes, and snippets.

@rjurney
Last active October 22, 2019 17:13
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rjurney/5f46d45d53ec6879a3a1ae710b108aa9 to your computer and use it in GitHub Desktop.
Save rjurney/5f46d45d53ec6879a3a1ae710b108aa9 to your computer and use it in GitHub Desktop.
Encoding tokenized text with gensim.models.Word2Vec
from gensim.models import Word2Vec
w2v_model = None
model_path = f'models/word2vec.model'
# Load the Word2Vec model if it exists
if os.path.exists(model_path):
w2v_model = Word2Vec.load(model_path)
else:
w2v_model = Word2Vec(
documents,
size=EMBEDDING_SIZE,
min_count=1,
window=5,
workers=NUM_CORES,
seed=1337
)
w2v_model.save(model_path)
# Show that similar words to 'program' print
w2v_model.wv.most_similar(positive='program')
# Encode the documents using the new embedding
encoded_docs = [[w2v_model.wv[word] for word in post] for post in documents]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment