Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 11 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save zhicongchen/9e23d5c3f1e5b1293b16133485cd17d8 to your computer and use it in GitHub Desktop.
Save zhicongchen/9e23d5c3f1e5b1293b16133485cd17d8 to your computer and use it in GitHub Desktop.
Code for aligning two gensim word2vec models using Procrustes matrix alignment (updated for compatibility with Gensim 4.0 API). The code is modified from https://gist.github.com/quadrismegistus/09a93e219a6ffc4f216fb85235535faf, which is originally ported from HistWords by William Hamilton: https://github.com/williamleif/histwords
def smart_procrustes_align_gensim(base_embed, other_embed, words=None):
"""
Original script: https://gist.github.com/quadrismegistus/09a93e219a6ffc4f216fb85235535faf
Procrustes align two gensim word2vec models (to allow for comparison between same word across models).
Code ported from HistWords <https://github.com/williamleif/histwords> by William Hamilton <wleif@stanford.edu>.
First, intersect the vocabularies (see `intersection_align_gensim` documentation).
Then do the alignment on the other_embed model.
Replace the other_embed model's syn0 and syn0norm numpy matrices with the aligned version.
Return other_embed.
If `words` is set, intersect the two models' vocabulary with the vocabulary in words (see `intersection_align_gensim` documentation).
"""
# patch by Richard So [https://twitter.com/richardjeanso) (thanks!) to update this code for new version of gensim
# base_embed.init_sims(replace=True)
# other_embed.init_sims(replace=True)
# make sure vocabulary and indices are aligned
in_base_embed, in_other_embed = intersection_align_gensim(base_embed, other_embed, words=words)
# get the (normalized) embedding matrices
base_vecs = in_base_embed.wv.get_normed_vectors()
other_vecs = in_other_embed.wv.get_normed_vectors()
# just a matrix dot product with numpy
m = other_vecs.T.dot(base_vecs)
# SVD method from numpy
u, _, v = np.linalg.svd(m)
# another matrix operation
ortho = u.dot(v)
# Replace original array with modified one, i.e. multiplying the embedding matrix by "ortho"
other_embed.wv.vectors = (other_embed.wv.vectors).dot(ortho)
return other_embed
def intersection_align_gensim(m1, m2, words=None):
"""
Intersect two gensim word2vec models, m1 and m2.
Only the shared vocabulary between them is kept.
If 'words' is set (as list or set), then the vocabulary is intersected with this list as well.
Indices are re-organized from 0..N in order of descending frequency (=sum of counts from both m1 and m2).
These indices correspond to the new syn0 and syn0norm objects in both gensim models:
-- so that Row 0 of m1.syn0 will be for the same word as Row 0 of m2.syn0
-- you can find the index of any word on the .index2word list: model.index2word.index(word) => 2
The .vocab dictionary is also updated for each model, preserving the count but updating the index.
"""
# Get the vocab for each model
vocab_m1 = set(m1.wv.index_to_key)
vocab_m2 = set(m2.wv.index_to_key)
# Find the common vocabulary
common_vocab = vocab_m1 & vocab_m2
if words: common_vocab &= set(words)
# If no alignment necessary because vocab is identical...
if not vocab_m1 - common_vocab and not vocab_m2 - common_vocab:
return (m1,m2)
# Otherwise sort by frequency (summed for both)
common_vocab = list(common_vocab)
common_vocab.sort(key=lambda w: m1.wv.get_vecattr(w, "count") + m2.wv.get_vecattr(w, "count"), reverse=True)
# print(len(common_vocab))
# Then for each model...
for m in [m1, m2]:
# Replace old syn0norm array with new one (with common vocab)
indices = [m.wv.key_to_index[w] for w in common_vocab]
old_arr = m.wv.vectors
new_arr = np.array([old_arr[index] for index in indices])
m.wv.vectors = new_arr
# Replace old vocab dictionary with new one (with common vocab)
# and old index2word with new one
new_key_to_index = {}
new_index_to_key = []
for new_index, key in enumerate(common_vocab):
new_key_to_index[key] = new_index
new_index_to_key.append(key)
m.wv.key_to_index = new_key_to_index
m.wv.index_to_key = new_index_to_key
print(len(m.wv.key_to_index), len(m.wv.vectors))
return (m1,m2)
@shikharsingla
Copy link

Hi, thanks for the updated code. it aligns two models, could you kindly tell me how to align multiple models so that a word is comparable across the models? so lets say if I have five models from 1950-1990 (one per decade), what should I do to make sure one word is comparable across all of them?

@zhicongchen
Copy link
Author

Hi, thanks for the updated code. it aligns two models, could you kindly tell me how to align multiple models so that a word is comparable across the models? so lets say if I have five models from 1950-1990 (one per decade), what should I do to make sure one word is comparable across all of them?

Hi, my understanding is to align all the five models to the same one, e.g., the model of the earliest decade 1950s. Do you think this is a reasonable approach?

@shikharsingla
Copy link

not sure, then the 1970 model wont be comparable to 1960 model, right?

@zhicongchen
Copy link
Author

zhicongchen commented Jun 22, 2021

not sure, then the 1970 model wont be comparable to 1960 model, right?

Probably you are right.

I just noticed that in HistWords the authors were doing sequential alignment between every consecutive time period. Here is the code: https://github.com/williamleif/histwords/blob/master/vecanalysis/seq_procrustes.py

@zhicongchen
Copy link
Author

Hi @zhicongchen,

is there a way of verifying that alignment has taken place as a sanity check? should one of the vector elements in the gensim model be identical in both models once the alignment has taken place -

also - does this alignment now allow for two of the same words in both models to be compared. I.e, can a cosine similarity be carried out between word1 from model1 and word2 from model2 once the models have been returned?

thank you

Hi @amjass12,

This is a quite good question indeed. Yet I'm not sure about a general way for verification currently. I think it may be more realistic to validate the results case by case, i.e., according to the specific research question in the study.

For the second question, I'm not sure either. However, I do see the authors calculate the cosine distance between word1 from model1 and word1 from model2. Please find the measurement for semantic displacement in section 2.4 in the HistWords paper: https://nlp.stanford.edu/projects/histwords/

@abbassix
Copy link

Hi @zhicongchen,

is there a way of verifying that alignment has taken place as a sanity check? should one of the vector elements in the gensim model be identical in both models once the alignment has taken place -

also - does this alignment now allow for two of the same words in both models to be compared. I.e, can a cosine similarity be carried out between word1 from model1 and word2 from model2 once the models have been returned?

thank you

You can randomly split the corpus into two corpora and train two models one for each corpus, i.e., model_1 and model_2. Then for the most common words you expect model_1.wv.similar_by_vector(model_2.wv[word_i]) to be word_i itself! That is what I did.

@amacanovic
Copy link

amacanovic commented Jul 30, 2021

Hi @zhicongchen,

Thank you for sharing this code!

Above you mentioned sequential alignment from the original paper; but I was unable to reproduce this with this script.
Let's assume I have 3 models, from periods 1, 2 and 3.

I start by aligning model 2 with model 1:

smart_procrustes_align_gensim(model_1, model_2)

This aligns the vocabularies between these two models and changes the vocab size and the vector dimensions appropriately.

Then, I wanted to align model 3 with model 2:

smart_procrustes_align_gensim(model_2, model_3)

But when trying to align these two models, I'd run into an issue with get_normed_vectors :

ValueError: operands could not be broadcast together with shapes (11496,100) (11780,1)

I believe the problem was coming from the fact that the dimensions of the vectors would change after models went through intersection_align_gensim, but the normed vectors remained in the old dimensions, as before the intersection.

So in my case, the vocab of model 2 was reduced from 11780 to 11496 after intersection with model 3; but the normed vectors that were initiated in the previous round of intersection (with model 1) still matched the original vocab size.

I believe I managed to solve this by clearing the normed vectors before attempting to pull them using get_normed_vectors() using the fill_norms(force=True) as in the code below:

# make sure vocabulary and indices are aligned
in_base_embed, in_other_embed = intersection_align_gensim(base_embed, other_embed, words=words)

# re-filling the normed vectors
in_base_embed.wv.fill_norms(force=True)
in_other_embed.wv.fill_norms(force=True)

# get the (normalized) embedding matrices
base_vecs = in_base_embed.wv.get_normed_vectors()  
other_vecs = in_other_embed.wv.get_normed_vectors()

Does this sound legitimate? I get meaningful results from models that are aligned in this manner, but thought I'd check if this step made sense.

Thank you!

@verazuo
Copy link

verazuo commented Sep 1, 2021

re-filling the normed vectors

in_base_embed.wv.fill_norms(force=True)
in_other_embed.wv.fill_norms(force=True)

Met the same error. Totally agree with your solutions. :D

@ajalvero
Copy link

ajalvero commented Oct 9, 2021

I'm trying to run the code but the output from smart_procrustes_align_gensim is the "other_embed" (ie. "other_embed == aligned_embed" returns True). Is anyone else having this issue?

Here is the code I was using (I'm not super familiar with gensim so I might have an issue elsewhere in the code):

`import pandas as pd
import gensim
from gensim.models import Word2Vec
import numpy as np

base_model = gensim.models.Word2Vec(list_of_tokens_1)
other_model = gensim.models.Word2Vec(list_of_tokens_2)

aligned_mod = smart_procrustes_align_gensim(base_model, other_model)

(Line below returns True)
aligned_mod == other_model`

@krkryger
Copy link

I'm trying to run the code but the output from smart_procrustes_align_gensim is the "other_embed" (ie. "other_embed == aligned_embed" returns True). Is anyone else having this issue?

Hey @ajalvero , I have the same issue. Did you find a solution?

@lkluo
Copy link

lkluo commented Aug 29, 2022

@krkryger other_embed.wv.vectors = (other_embed.wv.vectors).dot(ortho) updates other_embed, too. This works for me:

other_embed_copy = copy.deepcopy(other_embed)
other_embed_copy.wv.vectors = (other_embed.wv.vectors).dot(ortho)
return other_embed_copy

@shafqatvirk
Copy link

How we can convert the aligned models to the format required by the visualization scripts. There are the scripts 'closest_over_time_with_anns.py' and others to plot a given word in different time spans, and it loads embeddings which need to be in a specific format e.g. '1910-w.npy' and '1910-vocab.pkl'. Any suggestions on this?

@estebarb
Copy link

Which gensim version this was designed? for me it is not working in 4.3, and 4.0 doesn't work with Python 3.10

@FuzzyAaron
Copy link

re-filling the normed vectors

in_base_embed.wv.fill_norms(force=True)
in_other_embed.wv.fill_norms(force=True)

get the (normalized) embedding matrices

base_vecs = in_base_embed.wv.get_normed_vectors()
other_vecs = in_other_embed.wv.get_normed_vectors()

I met the same issue. Any guide would be appreciated.

@estebarb
Copy link

I ended up using https://github.com/theochem/procrustes instead. Something like this:

    from procrustes import rotational
    common_words = sorted(CURRENT_WORDS.intersection(base_words))
    print(f"    Common words: {len(common_words)}")

    common_words_embeddings_base = np.array([base_embeddings[word] for word in common_words])
    common_words_embeddings_current = np.array([current_embeddings[word] for word in common_words])

    # find the rotation matrix using orthogonal procrustes
    rotation_matrix = rotational(common_words_embeddings_base, common_words_embeddings_current)

    # apply the rotation matrix to the embeddings in words old
    base_words_embeddings_rotated = rotation_matrix.new_a

    rotated_model = KeyedVectors(300)
    rotated_model.add_vectors(common_words, base_words_embeddings_rotated)
    rotated_model.save("aligned.kv")

    # Now release the memory and load the aligned vectors again

@FuzzyAaron
Copy link

@estebarb Thank you so much for sharing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment