stephenroller/notesonembeddings.md

## notesonembeddings.md

      
    Raw
  

              notesonembeddings.md
            
          
    Notes on Predicting Word Vectors

Past few months, I've been trying several methods of directly predicting a what word belongs in a given syntactic context. For example, if we have:
dog X man with teeth,

then the syntactic context [1] of the X token is:
nsubj(dog)
dobj(man)
pp:with(teeth)

The question is then, what word could X be, such that it has these syntactic contexts?
One natural way to approach this is language modeling, but my research is in Distributional Semantics, so I've focused on approaches in this area.
One simple approach has been to use a Syntactic Distributional space (Pado and Erk, 2007). In these spaces, rather than create a distributional space by counting a word's positional neighbors, we count its syntactic neighbors. A syntactic neighbor then consists of the syntactic relation and word it appears in. Such a distributional space exlempifies different properties than a context window space: namely, words similar in syntactic space tend to share selectional preferences, rather than more topical preferences. For example, doctor and hospital will be highly similar in a window space (and they are clearly related), but will be very dissimilar in syntactic space (since the things doctors do and hospitals do are very different).
Using these intutions, I've developed multiple Deep Learning models for the task, guided around one principal: rather than predict the word that appears in the context, we should predict the vector of the word that appears in that context, and then compare to our lexicon. In this document, I'll describe some of the interesting patterns I've noticed across different models.

Objective function matters a lot

MSE fails hard: word vectors are not sparse in any sense, but usually a couple principal components contribute to the vast majority of a word's magnitude. MSE seems to do very well at estimating these principal components, but then does horribly at the longer tail.
Dot product works, a bit: This does better than MSE, but ends up with a different issue. Vector magnitudes are porportional related to the frequency of the word in the original corpus. [2] Highly frequent words like "do" or "have" will have magnitudes much larger than more informative words like "remedied" and "repaired". The result is that your unigram statistics can overwhelm much of what the system learns.
Exponential dot product works amazing: Using cross-category entropy, and a softmax over the vocabulary and dot product. This is the objective function the original Word2Vec paper uses (except using Hierarchical Softmax) and often used in deep learning. It has the advantage of strongly penalizing incorrect uses and very greatly rewarding correct answers, but is stupidly slow.
Cosine does pretty well: It's not nearly as well as the exponential dot product, but has a compute cost similar to the dot product or MSE. I've found this to be the best tradeoff without resorting to Hierarchical Softmax or Negative Sampling


Orthogonal dimensions help, at least initially:


This is pretty well known in classical neural network literature, but seemingly completely ignored since the invention of Word2Vec.


This creates a strong degree of independence between dimensions, making them easier to learn over


We can perform find an orthogonal basis by using the SVD, then dropping the V^T term
  M = U*S*V^T
  M' = U*S


Note that the magnitude, dot product, cosine, and euclidean distance are all perserved by this transformation, up to numerical stability. In practice, I've found this procedure is very unstable, but works well enough. Additionally, since the popular embedding matrices (like Word2Vec) are at most a few gigabytes, this only takes a couple minutes.


I've found things gnerally converage around the same spot, but this results in much quicker convergence. In my experiments on Lexical Entailment, this preprocessing dramatically helped things.


Just remember: if you also need the context matrix, to propagate the V^T term over as well.


Don't standardize in mean/variance:

Especially if you don't make the dimensions orthogonal!!!
It's not enough that the word has this dimension more than the average word: it also hugely matters the proportion of this component to the others.
This is particularly true if you've not made sure the dimensions have an orthogonal basis: now you've completely destroyed the item's ability
What's better: Just divide the entire embedding matrix by a single constant, say the (absolute)  maximum entry in the entire matrix. Now things have the same -1 to 1 scale we wanted, but without destroying the relations