Thank you for sharing. It helped me to understand word2vec
Thank you for sharing the code.
Please mention how to save the model for future use or analysis? I have already started training on very large data using Pycharm. Please let me know the solution how can I use it the same in the future without stopping the current training.
I'm getting the following exception in the final cell when attempting to run this in a fresh Google Colab notebook:
IndexError Traceback (most recent call last)
[<ipython-input-10-400456173921>](https://localhost:8080/#) in <module>()
17
18 loss = F.nll_loss(log_softmax.view(1,-1), y_true)
---> 19 loss_val += loss.data[0]
20 loss.backward()
21 W1.data -= learning_rate * W1.grad.data
IndexError: invalid index of a 0-dim tensor. Use `tensor.item()` in Python or `tensor.item<T>()` in C++ to convert a 0-dim tensor to a number
To fix this, I Googled the above error and followed the workaround described in this GitHub issue:
NVIDIA/flownet2-pytorch#113
The fix is to change line 19 from:
loss_val += loss.data[0]
to
loss_val += loss.data
I believe this is an issue with the version of one of the libraries (PyTorch?) that was used the code above versus the version I am using now.
This is not a problem with the code, it is a problem with the calculation that you are doing. W1 is the matrix that maps the 1-hot vectors into the embedding representation. W2 maps the embedded words into a vector which is passed into the soft max function to generate probabilities for the context words.
The point of Word2Vec is to use the embeddings to represent each word because words that are close in embedding are close in context. To get the embedding for a word, you need to take the product of the W1 matrix with the 1-hot encoding of the word that you're interested in, i.e. W1*one_hot[word], where one_hot[word] is the one hot vector for word.
Here is some code you can use to get the embeddings and check the similarity of two words:
`word1 = 'he'
word2 = 'king'
w1v = torch.matmul(W1,get_input_layer(word2idx[word1]))
w2v = torch.matmul(W1,get_input_layer(word2idx[word2]))
similarity(w1v,w2v)`
Here, w1v and w2v are the embeddings for word1 and word2, respectively. Earlier in the code get_input_layer() was defined to generate a onehot vector based on the integer encoding of the unique words, so we use that above to get the one hot which is them matmuled with W1 to get the embedding.
As a note, the corpus is very small for this example problem, so you can't expect to get super great results for likeness between word pairs.