Skip to content

Instantly share code, notes, and snippets.

@dav009
Created February 19, 2015 11:32
Show Gist options
  • Star 4 You must be signed in to star a gist
  • Fork 3 You must be signed in to fork a gist
  • Save dav009/10a742de43246210f3ba to your computer and use it in GitHub Desktop.
Save dav009/10a742de43246210f3ba to your computer and use it in GitHub Desktop.
import gensim
import codecs
from gensim.models import Word2Vec
import json
def export_to_file(path_to_model, output_file):
output = codecs.open(output_file, 'w' , 'utf-8')
model = Word2Vec.load_word2vec_format(path_to_model, binary=True)
vocab = model.vocab
for mid in vocab:
#print(model[mid])
print(mid)
vector = list()
for dimension in model[mid]:
vector.append(str(dimension))
#line = { "mid": mid, "vector": vector }
vector_str = ",".join(vector)
line = mid + "\t" + vector_str
#line = json.dumps(line)
output.write(line + "\n")
output.close()
@Franck-Dernoncourt
Copy link

Works great, thanks! With GoogleNews-vectors-negative300.bin.gz it requires around 5.1 GB of RAM, the output file is 9.5 GB, and it takes ~30 minutes on a decent CPU.

@jkkummerfeld
Copy link

Note, gensim.models.Word2Vec.load_word2vec_format has been deprecated. Switching "Word2Vec" to be "KeyedVectors" works.

(I got here from https://github.com/sriniiyer/nl2sql which references this code).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment