Skip to content

Instantly share code, notes, and snippets.

@mbednarski
Created March 6, 2018 23:34
Show Gist options
  • Star 41 You must be signed in to star a gist
  • Fork 22 You must be signed in to fork a gist
  • Save mbednarski/da08eb297304f7a66a3840e857e060a0 to your computer and use it in GitHub Desktop.
Save mbednarski/da08eb297304f7a66a3840e857e060a0 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@i-chaochen
Copy link

i-chaochen commented Jul 22, 2020

Thank you for sharing. It helped me to understand word2vec

But here is a problem with this code.

How it comes that 'she' is closer to 'king' than to 'queen' ?

def similarity(v,u):
  return torch.dot(v,u)/(torch.norm(v)*torch.norm(u))

similarity(W2[word2idx["she"]], W2[word2idx["king"]]) # = 0.5035

similarity(W2[word2idx["she"]], W2[word2idx["queen"]]) # = 0.1582

I got different results as you can see the following:

In [29]: similarity(W2[word2idx["she"]], W2[word2idx["king"]])
Out[29]: tensor(-0.3742, grad_fn=<DivBackward0>)

In [30]: similarity(W2[word2idx["she"]], W2[word2idx["queen"]])
Out[30]: tensor(0.4153, grad_fn=<DivBackward0>)

FYI: you can increase the embedding dimensions and num_epochs to make it more accurate.

@IlyaStepanov
Copy link

IlyaStepanov commented Oct 22, 2020

@Hongyanld
please refer to this blog


Last layer must have vocabulary_size neurons — because it generates probabilities for each word. 
Therefore, W2 is [vocabulary_size, embedding_dims] in terms of shape.

The shape is right, but I have the same question as @Hongyanld, all parameters are already in W1, hence W2 = W1.T ?
but the optimization runs for both W1 and W2 - twice as many parameters

The code is super-useful though, thanks for sharing!

@cbarquist5c
Copy link

cbarquist5c commented Apr 1, 2021

Thank you for sharing. It helped me to understand word2vec

But here is a problem with this code.

How it comes that 'she' is closer to 'king' than to 'queen' ?

def similarity(v,u):
  return torch.dot(v,u)/(torch.norm(v)*torch.norm(u))

similarity(W2[word2idx["she"]], W2[word2idx["king"]]) # = 0.5035

similarity(W2[word2idx["she"]], W2[word2idx["queen"]]) # = 0.1582

This is not a problem with the code, it is a problem with the calculation that you are doing. W1 is the matrix that maps the 1-hot vectors into the embedding representation. W2 maps the embedded words into a vector which is passed into the soft max function to generate probabilities for the context words.

The point of Word2Vec is to use the embeddings to represent each word because words that are close in embedding are close in context. To get the embedding for a word, you need to take the product of the W1 matrix with the 1-hot encoding of the word that you're interested in, i.e. W1*one_hot[word], where one_hot[word] is the one hot vector for word.

Here is some code you can use to get the embeddings and check the similarity of two words:
`word1 = 'he'
word2 = 'king'
w1v = torch.matmul(W1,get_input_layer(word2idx[word1]))
w2v = torch.matmul(W1,get_input_layer(word2idx[word2]))

similarity(w1v,w2v)`

Here, w1v and w2v are the embeddings for word1 and word2, respectively. Earlier in the code get_input_layer() was defined to generate a onehot vector based on the integer encoding of the unique words, so we use that above to get the one hot which is them matmuled with W1 to get the embedding.

As a note, the corpus is very small for this example problem, so you can't expect to get super great results for likeness between word pairs.

@sunyong2016
Copy link

Thank you for sharing. It helped me to understand word2vec

@Code4luv
Copy link

Code4luv commented Jan 5, 2022

Thank you for sharing the code.

Please mention how to save the model for future use or analysis? I have already started training on very large data using Pycharm. Please let me know the solution how can I use it the same in the future without stopping the current training.

@theBull
Copy link

theBull commented Feb 9, 2022

I'm getting the following exception in the final cell when attempting to run this in a fresh Google Colab notebook:

IndexError                                Traceback (most recent call last)
[<ipython-input-10-400456173921>](https://localhost:8080/#) in <module>()
     17 
     18         loss = F.nll_loss(log_softmax.view(1,-1), y_true)
---> 19         loss_val += loss.data[0]
     20         loss.backward()
     21         W1.data -= learning_rate * W1.grad.data

IndexError: invalid index of a 0-dim tensor. Use `tensor.item()` in Python or `tensor.item<T>()` in C++ to convert a 0-dim tensor to a number

To fix this, I Googled the above error and followed the workaround described in this GitHub issue:
NVIDIA/flownet2-pytorch#113

The fix is to change line 19 from:
loss_val += loss.data[0]
to
loss_val += loss.data

I believe this is an issue with the version of one of the libraries (PyTorch?) that was used the code above versus the version I am using now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment