Yevgnen/test.org

## test.org

      
    Raw
  

              test.org
            
          
    Articles

2014 - Word2vec Explained: Deriving Mikolov Et Al.’s  Negative-Sampling Word-Embedding Method


  State “DONE”       from “TODO”       [2016-12-13 Tue 09:34]

cite:goldberg-2014-word2-explain
Notes for page 2

Assumption


  The words and the contexts com from distinct vocabularies, so that, the
    vector associated with the word dog will be different from the vector
    associated with the context dog.
  Maximizing objective
    will result in good embeddings \(v_w\), \(∀ w ∈ V\), in the sense
      that similar words will have similar
  

vectors.
Notes for page 3


  To prevent all the vectors from having the same value, one way is to present
    the model with some \((w, c)\) pairs for which \(p(D=1| w, c; θ)\) must
    be low, i.e. pairs which are not in the data. This is achieved by negative
    sampling.

Notes for page 4


  Unlike the origin Skip-gram model, we does not model \(p(c| w)\) by instead
    model a quantity related to the join distribution of \(w\) and \(c\).
  The model is non-convex when the words and contexts representations are
    learned jointly. If we fix the words representation and learn only the
    contexts representations, or fix the contexts representation and learn only
    the word representations, the model reduced to logistic regression, and is
    convex.
  Sampling details.

Notes for page 5

Peculiarities of the contexts

Dynamic window size

The parameter \(k\) denotes the maximal window size. For each word in the
  corpus, a window size \(k’\) is sampled uniformly from \(1, …, k\).
Effect of subsampling and rare-word pruning


  Words appearing less that min-count times are not considered as either
    words or contexts.
  Frequent words are down-sampled for frequent words are less informative.


  Here we see another explanation for its effec- tiveness: the effective window
    size grows, including context-words which are both content-full and linearly far
    away from the focus word, thus making the similarities more topical.

Importantly, these words are removed from the text before generating the
  contexts. This has the effect of increasing the effective window size for a
  certain words.
Why does this produce good word representations?

The distributional hypothesis states that words in similar contexts have
  similar meanings. The objective above clearly tries to increase the quantity
  \(v_w^\topv_c\) for good word-context pairs, and decrease it for bad
  ones. Intuitively, this means that words that share many contexts will be
  similar to each other (note also that contexts sharing many words will also be
  similar to each other). This is, however, very hand-wavy.
2010 - Word representations: a simple and general method for semi-supervised learning

cite:turian-2010-word-repres
2014 - Dependency-Based Word Embeddings.

cite:levy-2014-depend-based
References
bibliographystyle:unsrt
  bibliography:references.bib