Skip to content

Instantly share code, notes, and snippets.

@rjurney
Created August 23, 2019 19:53
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rjurney/ab31181197425b20f1c7df038460e019 to your computer and use it in GitHub Desktop.
Save rjurney/ab31181197425b20f1c7df038460e019 to your computer and use it in GitHub Desktop.
Pad Word2Vec Posts with the Min/Max of the Entire Corpus
from math import ceil
padded_posts = []
for post in encoded_docs:
# Pad short posts with alternating min/max
if len(post) < MAX_LENGTH:
pointwise_min = np.minimum.reduce(post)
pointwise_max = np.maximum.reduce(post)
padding = [pointwise_max, pointwise_min]
post += padding * ceil((MAX_LENGTH - len(post) / 2.0))
# Shorten long posts or those odd number length posts we padded to 51
if len(post) > MAX_LENGTH:
post = post[:MAX_LENGTH]
padded_posts.append(post)
# Verify their lengths
assert(min([len(post) for post in padded_posts]) == MAX_LENGTH)
assert(max([len(post) for post in padded_posts]) == MAX_LENGTH)
# Free up the RAM, since we copied the data
del encoded_docs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment