Instantly share code, notes, and snippets.

lgalke/all_but_the_top.py

Last active February 22, 2022 23:46
Show Gist options
• Save lgalke/febaaa1313d9c11f3bc8240defed8390 to your computer and use it in GitHub Desktop.
Word Embedding Postprocessing: All but the top
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters. Learn more about bidirectional Unicode characters
 """ All-but-the-Top: Simple and Effective Postprocessing for Word Representations Paper: https://arxiv.org/abs/1702.01417 Last Updated: Fri 15 Nov 2019 11:47:00 AM CET **Prior version had serious issues, please excuse any inconveniences.** """ import numpy as np from sklearn.decomposition import PCA def all_but_the_top(v, D): """ Arguments: :v: word vectors of shape (n_words, n_dimensions) :D: number of principal components to subtract """ # 1. Subtract mean vector v_tilde = v - np.mean(v, axis=0) # 2. Compute the first `D` principal components # on centered embedding vectors u = PCA(n_components=D).fit(v_tilde).components_ # [D, emb_size] # Subtract first `D` principal components # [vocab_size, emb_size] @ [emb_size, D] @ [D, emb_size] -> [vocab_size, emb_size] return v_tilde - (v @ u.T @ u)

aerinkim commented Aug 12, 2018

Hey this implementation is wrong. You shouldn't use fit_transform. You need U, singular vector itelf.

gojomo commented Nov 14, 2019

Using this all-but-the-top transformation didn't result in the expected improvement on a word-vector evaluation. (Specifically, `questions-words.txt` analogies-correct on 'GoogleNews' top-200k words.) In fact, it drove accuracy down from 75.37% to 59.45%.

Using instead the all-but-the-top implemenation at https://github.com/s1998/All-but-the-top/blob/e0c7d758b495ad55868d9a14ecd31df86b79e4d3/src/embeddings_processor.py#L4 slightly improved accuracy, as would be expected from the paper's claims, to 75.79%.

So: more evidence this implementation is off.

lgalke commented Nov 14, 2019

@gojomo thanks for pointing that out! I will try to fix it as soon as possible. Please excuse the confusion.

lgalke commented Nov 14, 2019 • edited

I adapted the code to match the version of this implementation. Changes are also ported back in vec4ir.

gojomo commented Nov 15, 2019 • edited

Thanks - trying this version (that ends `return v - (v_tilde @ u.T @ u)`) has the expected behavior in my evaluations!

lgalke commented Nov 15, 2019 • edited

@gojomo I just double checked with the paper itself and it should be `v_tilde - (v @ u.T @ u)` instead of `v - (v_tilde @...`.

Applying PCA to centered / non-centered versions should not make a difference. The important thing is to subtract the mean from the embeddings to match the paper. `v_tilde` holds the centered version in this gist so now we return the `embeddings - mean - first D principal components`.

@s1998: I think this last point is also not considered in your implementation. In line 11 you do not subtract from the centered version. Am I missing something?

s1998 commented Nov 18, 2019

@lgalke You are correct, it should be mean centered embeddings that I need to subtract from. I'll fix that, thanks.