Skip to content

Instantly share code, notes, and snippets.

@erickrf
Last active December 18, 2023 09:13
Show Gist options
  • Star 5 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save erickrf/e54cd0f3d917ec61b3ae758a5e47b883 to your computer and use it in GitHub Desktop.
Save erickrf/e54cd0f3d917ec61b3ae758a5e47b883 to your computer and use it in GitHub Desktop.
Read embeddings file in text format and convert to numpy
import numpy as np
import argparse
parser = argparse.ArgumentParser()
parser.add_argument('input', help='Single embedding file')
parser.add_argument('output', help='Output basename without extension')
args = parser.parse_args()
embeddings_file = args.output + '.npy'
vocabulary_file = args.output + '.txt'
words = []
vectors = []
with open(args.input, 'rb') as f:
for line in f:
fields = line.split()
word = fields[0].decode('utf-8')
vector = np.fromiter((float(x) for x in fields[1:]),
dtype=np.float)
words.append(word)
vectors.append(vector)
matrix = np.array(vectors)
np.save(embeddings_file, matrix)
text = '\n'.join(words)
with open(vocabulary_file, 'wb') as f:
f.write(text.encode('utf-8'))
@phze22
Copy link

phze22 commented Dec 18, 2023

np.float not supported anymore in numpy 1.20 (https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations), can be exchanged with float or np.float64
otherwise useful script, thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment