Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Understanding word vectors: A tutorial for "Reading and Writing Electronic Text," a class I teach at ITP. (Python 2.7) Code examples released under CC0 https://creativecommons.org/choose/zero/, other text released under CC BY 4.0 https://creativecommons.org/licenses/by/4.0/
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@arielgamino

This comment has been minimized.

Copy link

@arielgamino arielgamino commented Mar 2, 2018

Very nice tutorial!

@carltoews

This comment has been minimized.

Copy link

@carltoews carltoews commented Mar 2, 2018

Thanks, this is great!

@tomnis

This comment has been minimized.

Copy link

@tomnis tomnis commented Mar 2, 2018

awesome! very intuitive explanations

@marcboeker

This comment has been minimized.

Copy link

@marcboeker marcboeker commented Mar 2, 2018

Great tutorial, thanks!

@sebastian-palma

This comment has been minimized.

Copy link

@sebastian-palma sebastian-palma commented Mar 8, 2018

Not sure why I'm getting the following error, working on macOS with Jupyter Lab, Python 2.7 and Spacy 2.0.9:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-2-090b6e832a74> in <module>()
      3 # It creates a list of unique words in the text
      4 tokens = list(set([w.text for w in doc if w.is_alpha]))
----> 5 print nlp.vocab['cheese'].vector

lexeme.pyx in spacy.lexeme.Lexeme.vector.__get__()

ValueError: Word vectors set to length 0. This may be because you don't have a model installed or loaded, or because your model doesn't include word vectors. For more info, see the documentation: 
https://spacy.io/usage/models
@rgibson

This comment has been minimized.

Copy link

@rgibson rgibson commented Mar 12, 2018

@vnhnhm
I was getting the same error. I fixed it by downloading a different language model for spaCy than what the instructions indicated (one that includes vectors). See spaCy documentation here: https://spacy.io/usage/models

So instead of running this from the command line:
python -m spacy download en
. . . and using this command in Jupyter:
nlp = spacy.load('en')

I ran this command in bash:
python -m spacy download en_core_web_md
. . . and did this in Jupyter
nlp = spacy.load('en_core_web_md')

Hope this helps!

@razodactyl

This comment has been minimized.

Copy link

@razodactyl razodactyl commented Mar 15, 2018

This write up is amazing, great work!

@VivekParupudi

This comment has been minimized.

Copy link

@VivekParupudi VivekParupudi commented Feb 17, 2019

UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 32764: character maps to

I get this error when I try to open the colors json file

@san-cc

This comment has been minimized.

Copy link

@san-cc san-cc commented Mar 18, 2019

I was looking for word2vec. But yours helps me a lot. Nice work!

@lyons7

This comment has been minimized.

Copy link

@lyons7 lyons7 commented Mar 22, 2019

This is the best tutorial I've ever read! So clear and easy to understand. Thanks for making this and putting it out there!

@suntaorus

This comment has been minimized.

Copy link

@suntaorus suntaorus commented Mar 23, 2019

This is a great tutorial. Thank you! Found this one through The Coding Train channel on YouTube.

@codeproy

This comment has been minimized.

Copy link

@codeproy codeproy commented Mar 23, 2019

One of the best intuitive tutorial !!

@joseberlines

This comment has been minimized.

Copy link

@joseberlines joseberlines commented Apr 7, 2019

One of the best tutorials on word to vec. Nevertheless there is a "quantum-leap" in the explanation when it comes to "Word vectors in spaCy". Suddenly we have vectors associated to any word, of a predetermined dimension. Why? Where are those vectors coming from? how are they calculated? Based on which texts? Since wordtovec takes into account context the vector representations are going to be very different in technical papers, in literature, poetry, facebook posts etc. How do you create your own vectors related to a particular collection of concepts over a particular set of documents? I observed this problematic in many many word2vec tutorials. The explanation starts very smoothly, basic, very well explained up to details; and suddenly there is a big hole in the explanation. In any case this is one of the best explanations I have found on wordtovec theory. thanks

@cjiras

This comment has been minimized.

Copy link

@cjiras cjiras commented May 29, 2019

Corrections for future readers: For what it's worth, the spreadsheet example containing the sentence “It was the best of times, it was the worst of times.” has an incorrect value within the cell given the row “times” and the column “the ___ of”. The value should be 0 not 1. Also, the “times” row and “of __ it” column should have a 1 not 0. Over all, a nice intro document with fun examples.

@Rajan-sust

This comment has been minimized.

Copy link

@Rajan-sust Rajan-sust commented Jun 3, 2019

meanv function can be simplified. See it in my forked gist :

https://gist.github.com/Rajan-sust/c38612d002ee5210e8b9cc4a55fce7de

@zetadaro

This comment has been minimized.

Copy link

@zetadaro zetadaro commented Jun 12, 2019

Great tutorial, thanks for sharing!

@zetadaro

This comment has been minimized.

Copy link

@zetadaro zetadaro commented Jun 12, 2019

One of the best tutorials on word to vec. Nevertheless there is a "quantum-leap" in the explanation when it comes to "Word vectors in spaCy". Suddenly we have vectors associated to any word, of a predetermined dimension. Why? Where are those vectors coming from? how are they calculated? Based on which texts? Since wordtovec takes into account context the vector representations are going to be very different in technical papers, in literature, poetry, facebook posts etc. How do you create your own vectors related to a particular collection of concepts over a particular set of documents? I observed this problematic in many many word2vec tutorials. The explanation starts very smoothly, basic, very well explained up to details; and suddenly there is a big hole in the explanation. In any case this is one of the best explanations I have found on wordtovec theory. thanks

@joseberlines
This is a very good question, did you find something related to the context issue?

@joseberlines

This comment has been minimized.

Copy link

@joseberlines joseberlines commented Jun 12, 2019

One of the best tutorials on word to vec. Nevertheless there is a "quantum-leap" in the explanation when it comes to "Word vectors in spaCy". Suddenly we have vectors associated to any word, of a predetermined dimension. Why? Where are those vectors coming from? how are they calculated? Based on which texts? Since wordtovec takes into account context the vector representations are going to be very different in technical papers, in literature, poetry, facebook posts etc. How do you create your own vectors related to a particular collection of concepts over a particular set of documents? I observed this problematic in many many word2vec tutorials. The explanation starts very smoothly, basic, very well explained up to details; and suddenly there is a big hole in the explanation. In any case this is one of the best explanations I have found on wordtovec theory. thanks

@joseberlines
This is a very good question, did you find something related to the context issue?

Hi,
the only thing I found so far is creating your own vectors with gensim, a python library that looks pretty good. I am nevertheless surprised that there is quite little as to comparing language models depending on context. I still don't have enough code and tests to be conclusive but I suspect that if you work in a particular area your model has to work better if the vectors were generated out of docs in your area. I hope it helps

@zetadaro

This comment has been minimized.

Copy link

@zetadaro zetadaro commented Jun 12, 2019

One of the best tutorials on word to vec. Nevertheless there is a "quantum-leap" in the explanation when it comes to "Word vectors in spaCy". Suddenly we have vectors associated to any word, of a predetermined dimension. Why? Where are those vectors coming from? how are they calculated? Based on which texts? Since wordtovec takes into account context the vector representations are going to be very different in technical papers, in literature, poetry, facebook posts etc. How do you create your own vectors related to a particular collection of concepts over a particular set of documents? I observed this problematic in many many word2vec tutorials. The explanation starts very smoothly, basic, very well explained up to details; and suddenly there is a big hole in the explanation. In any case this is one of the best explanations I have found on wordtovec theory. thanks

@joseberlines
This is a very good question, did you find something related to the context issue?

Hi,
the only thing I found so far is creating your own vectors with gensim, a python library that looks pretty good. I am nevertheless surprised that there is quite little as to comparing language models depending on context. I still don't have enough code and tests to be conclusive but I suspect that if you work in a particular area your model has to work better if the vectors were generated out of docs in your area. I hope it helps

Hi,
Thanks for your reply.
Makes total sense what you are saying, I will check and try gensim.

@NaelsonDouglas

This comment has been minimized.

Copy link

@NaelsonDouglas NaelsonDouglas commented Jun 15, 2019

You just got a heck of a complicated stuff and made it into some simple paragraphs! Thanks for that, thanks for this amazing tutorial.

@jordantkohn

This comment has been minimized.

Copy link

@jordantkohn jordantkohn commented Jul 1, 2019

For the Dracula and Wallpaper examples, isn't the code only checking for one-word colors (e.g. blue, yellow, red) and not for two-word colors as well (e.g. burnt orange, tomato red, sunflower yellow)??

@girgop

This comment has been minimized.

Copy link

@girgop girgop commented Jul 20, 2019

Thanks! Helped me to get ground on word vectors.

@aafreenrah

This comment has been minimized.

Copy link

@aafreenrah aafreenrah commented Aug 31, 2019

Best explanation. Thank you very much.

@lschomp

This comment has been minimized.

Copy link

@lschomp lschomp commented Sep 7, 2019

this looks super exciting and I'm eager to explore but right off this line isn't working for me? color_data = json.loads(open("xkcd.json").read()) in the notebook?


FileNotFoundError Traceback (most recent call last)
in
----> 1 color_data = json.loads(open("xkcd.json").read())

FileNotFoundError: [Errno 2] No such file or directory: 'xkcd.json'

@Mayar2009

This comment has been minimized.

Copy link

@Mayar2009 Mayar2009 commented Sep 29, 2019

@lschomp you should install the file in the same path with notebook to work

@Mayar2009

This comment has been minimized.

Copy link

@Mayar2009 Mayar2009 commented Sep 29, 2019

One of the best tutorials on word to vec. Nevertheless there is a "quantum-leap" in the explanation when it comes to "Word vectors in spaCy". Suddenly we have vectors associated to any word, of a predetermined dimension. Why? Where are those vectors coming from? how are they calculated? Based on which texts? Since wordtovec takes into account context the vector representations are going to be very different in technical papers, in literature, poetry, facebook posts etc. How do you create your own vectors related to a particular collection of concepts over a particular set of documents? I observed this problematic in many many word2vec tutorials. The explanation starts very smoothly, basic, very well explained up to details; and suddenly there is a big hole in the explanation. In any case this is one of the best explanations I have found on wordtovec theory. thanks

I agree
for color example
what if there where another colors but they are not in the colors dictionary how to discover them
how to use this technique in recommender systems for example?

@imkhubaibraza

This comment has been minimized.

Copy link

@imkhubaibraza imkhubaibraza commented Dec 3, 2019

Great explanation, Thank you for sharing

@john77eipe

This comment has been minimized.

Copy link

@john77eipe john77eipe commented Dec 19, 2019

The whole code has been updated for the latest versions of python and spacy.

Available here as a notebook: https://www.kaggle.com/john77eipe/understanding-word-vectors

@DouglasLapsley

This comment has been minimized.

Copy link

@DouglasLapsley DouglasLapsley commented Feb 17, 2020

Fantastic article thank you. How could I scale this to compare a single sentence with around a million other sentences to find the most similar ones though? I’m thinking that iteration wouldn’t be an option? Thanks!

@zetadaro

This comment has been minimized.

Copy link

@zetadaro zetadaro commented Feb 17, 2020

Fantastic article thank you. How could I scale this to compare a single sentence with around a million other sentences to find the most similar ones though? I’m thinking that iteration wouldn’t be an option? Thanks!

I have the same problem and I was thinking about the non performance of iterate over so many sentences. If you get something interesting related to this it will be great to share it, I will do the same.

@DouglasLapsley

This comment has been minimized.

Copy link

@DouglasLapsley DouglasLapsley commented Feb 18, 2020

Fantastic article thank you. How could I scale this to compare a single sentence with around a million other sentences to find the most similar ones though? I’m thinking that iteration wouldn’t be an option? Thanks!

I have the same problem and I was thinking about the non performance of iterate over so many sentences. If you get something interesting related to this it will be great to share it, I will do the same.

Will do. I've looked at a number of text similarity approaches and they all seem to either rely on iteration or semantic word graphs with a pre-calculated one to one similarity relationship between all the nodes which means 1M nodes = 1M x 1M relationships which is also clearly untennable and very slow to process. I'm sure I must be missing something obvious, but I'm not sure what!

The only thing I can think of at the moment is pre-processing the similarity by saving the vectors for each sentence against their database record, then iterating for each sentence against each other sentence in the way described in this article (or any other similarity distance function), and then saving a one to one graph similarity relationship only for items that are highly similar (to reduce the number of similarity relationships created only to relevent ones). i.e. don't do the iteration at run-time, but rather do the iteration once and save the resulting similarities where of high similarity in node relationships which would be quick to query at runtime.

I'd welcome any guidance from others on this though! Has anyone tried this approach?

@DouglasLapsley

This comment has been minimized.

Copy link

@DouglasLapsley DouglasLapsley commented Feb 19, 2020

Fantastic article thank you. How could I scale this to compare a single sentence with around a million other sentences to find the most similar ones though? I’m thinking that iteration wouldn’t be an option? Thanks!

I have the same problem and I was thinking about the non performance of iterate over so many sentences. If you get something interesting related to this it will be great to share it, I will do the same.

It looks like a pre-trained approximate nearest neighbour approach may be a good option where you have large numbers of vectors. I've not yet tried this, but here is the logic https://erikbern.com/2015/09/24/nearest-neighbor-methods-vector-models-part-1.html and here is an implementation https://medium.com/@kevin_yang/simple-approximate-nearest-neighbors-in-python-with-annoy-and-lmdb-e8a701baf905

Using the Annoy library, essentially the approach here is to create an lmdb map and an Annoy index with the word embeddings. Then save those to disk. At runtime, load these, vectorise your query text and use Annoy to look up n nearest neighbours and return their IDs.

Anyone have experience of this with sentences rather than just words?

@juhanishen

This comment has been minimized.

Copy link

@juhanishen juhanishen commented Jun 16, 2020

Awesome good!

@my-name-is-kaya

This comment has been minimized.

Copy link

@my-name-is-kaya my-name-is-kaya commented Jun 24, 2020

Thanks, great tutorial!

@juliansteam

This comment has been minimized.

Copy link

@juliansteam juliansteam commented Aug 14, 2020

Vey intuitive tutorial. Thank you!

@erkekin

This comment has been minimized.

Copy link

@erkekin erkekin commented Aug 25, 2020

Not sure why I'm getting the following error, working on macOS with Jupyter Lab, Python 2.7 and Spacy 2.0.9:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-2-090b6e832a74> in <module>()
      3 # It creates a list of unique words in the text
      4 tokens = list(set([w.text for w in doc if w.is_alpha]))
----> 5 print nlp.vocab['cheese'].vector

lexeme.pyx in spacy.lexeme.Lexeme.vector.__get__()

ValueError: Word vectors set to length 0. This may be because you don't have a model installed or loaded, or because your model doesn't include word vectors. For more info, see the documentation: 
https://spacy.io/usage/models

replace nlp.vocab['cheese'].vector with nlp('cheese').vector
and

def vec(s):
    return nlp.vocab[s].vector

with

def vec(s):
    return nlp(s).vector
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.