Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Understanding word vectors: A tutorial for "Reading and Writing Electronic Text," a class I teach at ITP. (Python 2.7) Code examples released under CC0 https://creativecommons.org/choose/zero/, other text released under CC BY 4.0 https://creativecommons.org/licenses/by/4.0/
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@arielgamino

This comment has been minimized.

Copy link

arielgamino commented Mar 2, 2018

Very nice tutorial!

@carltoews

This comment has been minimized.

Copy link

carltoews commented Mar 2, 2018

Thanks, this is great!

@tomnis

This comment has been minimized.

Copy link

tomnis commented Mar 2, 2018

awesome! very intuitive explanations

@marcboeker

This comment has been minimized.

Copy link

marcboeker commented Mar 2, 2018

Great tutorial, thanks!

@vnhnhm

This comment has been minimized.

Copy link

vnhnhm commented Mar 8, 2018

Not sure why I'm getting the following error, working on macOS with Jupyter Lab, Python 2.7 and Spacy 2.0.9:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-2-090b6e832a74> in <module>()
      3 # It creates a list of unique words in the text
      4 tokens = list(set([w.text for w in doc if w.is_alpha]))
----> 5 print nlp.vocab['cheese'].vector

lexeme.pyx in spacy.lexeme.Lexeme.vector.__get__()

ValueError: Word vectors set to length 0. This may be because you don't have a model installed or loaded, or because your model doesn't include word vectors. For more info, see the documentation: 
https://spacy.io/usage/models
@rgibson

This comment has been minimized.

Copy link

rgibson commented Mar 12, 2018

@vnhnhm
I was getting the same error. I fixed it by downloading a different language model for spaCy than what the instructions indicated (one that includes vectors). See spaCy documentation here: https://spacy.io/usage/models

So instead of running this from the command line:
python -m spacy download en
. . . and using this command in Jupyter:
nlp = spacy.load('en')

I ran this command in bash:
python -m spacy download en_core_web_md
. . . and did this in Jupyter
nlp = spacy.load('en_core_web_md')

Hope this helps!

@razodactyl

This comment has been minimized.

Copy link

razodactyl commented Mar 15, 2018

This write up is amazing, great work!

@VivekParupudi

This comment has been minimized.

Copy link

VivekParupudi commented Feb 17, 2019

UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 32764: character maps to

I get this error when I try to open the colors json file

@san-cc

This comment has been minimized.

Copy link

san-cc commented Mar 18, 2019

I was looking for word2vec. But yours helps me a lot. Nice work!

@lyons7

This comment has been minimized.

Copy link

lyons7 commented Mar 22, 2019

This is the best tutorial I've ever read! So clear and easy to understand. Thanks for making this and putting it out there!

@suntaorus

This comment has been minimized.

Copy link

suntaorus commented Mar 23, 2019

This is a great tutorial. Thank you! Found this one through The Coding Train channel on YouTube.

@proy251183

This comment has been minimized.

Copy link

proy251183 commented Mar 23, 2019

One of the best intuitive tutorial !!

@joseberlines

This comment has been minimized.

Copy link

joseberlines commented Apr 7, 2019

One of the best tutorials on word to vec. Nevertheless there is a "quantum-leap" in the explanation when it comes to "Word vectors in spaCy". Suddenly we have vectors associated to any word, of a predetermined dimension. Why? Where are those vectors coming from? how are they calculated? Based on which texts? Since wordtovec takes into account context the vector representations are going to be very different in technical papers, in literature, poetry, facebook posts etc. How do you create your own vectors related to a particular collection of concepts over a particular set of documents? I observed this problematic in many many word2vec tutorials. The explanation starts very smoothly, basic, very well explained up to details; and suddenly there is a big hole in the explanation. In any case this is one of the best explanations I have found on wordtovec theory. thanks

@cjiras

This comment has been minimized.

Copy link

cjiras commented May 29, 2019

Corrections for future readers: For what it's worth, the spreadsheet example containing the sentence “It was the best of times, it was the worst of times.” has an incorrect value within the cell given the row “times” and the column “the ___ of”. The value should be 0 not 1. Also, the “times” row and “of __ it” column should have a 1 not 0. Over all, a nice intro document with fun examples.

@Rajan-sust

This comment has been minimized.

Copy link

Rajan-sust commented Jun 3, 2019

meanv function can be simplified. See it in my forked gist :

https://gist.github.com/Rajan-sust/c38612d002ee5210e8b9cc4a55fce7de

@zetadaro

This comment has been minimized.

Copy link

zetadaro commented Jun 12, 2019

Great tutorial, thanks for sharing!

@zetadaro

This comment has been minimized.

Copy link

zetadaro commented Jun 12, 2019

One of the best tutorials on word to vec. Nevertheless there is a "quantum-leap" in the explanation when it comes to "Word vectors in spaCy". Suddenly we have vectors associated to any word, of a predetermined dimension. Why? Where are those vectors coming from? how are they calculated? Based on which texts? Since wordtovec takes into account context the vector representations are going to be very different in technical papers, in literature, poetry, facebook posts etc. How do you create your own vectors related to a particular collection of concepts over a particular set of documents? I observed this problematic in many many word2vec tutorials. The explanation starts very smoothly, basic, very well explained up to details; and suddenly there is a big hole in the explanation. In any case this is one of the best explanations I have found on wordtovec theory. thanks

@joseberlines
This is a very good question, did you find something related to the context issue?

@joseberlines

This comment has been minimized.

Copy link

joseberlines commented Jun 12, 2019

One of the best tutorials on word to vec. Nevertheless there is a "quantum-leap" in the explanation when it comes to "Word vectors in spaCy". Suddenly we have vectors associated to any word, of a predetermined dimension. Why? Where are those vectors coming from? how are they calculated? Based on which texts? Since wordtovec takes into account context the vector representations are going to be very different in technical papers, in literature, poetry, facebook posts etc. How do you create your own vectors related to a particular collection of concepts over a particular set of documents? I observed this problematic in many many word2vec tutorials. The explanation starts very smoothly, basic, very well explained up to details; and suddenly there is a big hole in the explanation. In any case this is one of the best explanations I have found on wordtovec theory. thanks

@joseberlines
This is a very good question, did you find something related to the context issue?

Hi,
the only thing I found so far is creating your own vectors with gensim, a python library that looks pretty good. I am nevertheless surprised that there is quite little as to comparing language models depending on context. I still don't have enough code and tests to be conclusive but I suspect that if you work in a particular area your model has to work better if the vectors were generated out of docs in your area. I hope it helps

@zetadaro

This comment has been minimized.

Copy link

zetadaro commented Jun 12, 2019

One of the best tutorials on word to vec. Nevertheless there is a "quantum-leap" in the explanation when it comes to "Word vectors in spaCy". Suddenly we have vectors associated to any word, of a predetermined dimension. Why? Where are those vectors coming from? how are they calculated? Based on which texts? Since wordtovec takes into account context the vector representations are going to be very different in technical papers, in literature, poetry, facebook posts etc. How do you create your own vectors related to a particular collection of concepts over a particular set of documents? I observed this problematic in many many word2vec tutorials. The explanation starts very smoothly, basic, very well explained up to details; and suddenly there is a big hole in the explanation. In any case this is one of the best explanations I have found on wordtovec theory. thanks

@joseberlines
This is a very good question, did you find something related to the context issue?

Hi,
the only thing I found so far is creating your own vectors with gensim, a python library that looks pretty good. I am nevertheless surprised that there is quite little as to comparing language models depending on context. I still don't have enough code and tests to be conclusive but I suspect that if you work in a particular area your model has to work better if the vectors were generated out of docs in your area. I hope it helps

Hi,
Thanks for your reply.
Makes total sense what you are saying, I will check and try gensim.

@NaelsonDouglas

This comment has been minimized.

Copy link

NaelsonDouglas commented Jun 15, 2019

You just got a heck of a complicated stuff and made it into some simple paragraphs! Thanks for that, thanks for this amazing tutorial.

@jordantkohn

This comment has been minimized.

Copy link

jordantkohn commented Jul 1, 2019

For the Dracula and Wallpaper examples, isn't the code only checking for one-word colors (e.g. blue, yellow, red) and not for two-word colors as well (e.g. burnt orange, tomato red, sunflower yellow)??

@girgop

This comment has been minimized.

Copy link

girgop commented Jul 20, 2019

Thanks! Helped me to get ground on word vectors.

@aafreenrah

This comment has been minimized.

Copy link

aafreenrah commented Aug 31, 2019

Best explanation. Thank you very much.

@lschomp

This comment has been minimized.

Copy link

lschomp commented Sep 7, 2019

this looks super exciting and I'm eager to explore but right off this line isn't working for me? color_data = json.loads(open("xkcd.json").read()) in the notebook?


FileNotFoundError Traceback (most recent call last)
in
----> 1 color_data = json.loads(open("xkcd.json").read())

FileNotFoundError: [Errno 2] No such file or directory: 'xkcd.json'

@Areejmayar

This comment has been minimized.

Copy link

Areejmayar commented Sep 29, 2019

@lschomp you should install the file in the same path with notebook to work

@Areejmayar

This comment has been minimized.

Copy link

Areejmayar commented Sep 29, 2019

One of the best tutorials on word to vec. Nevertheless there is a "quantum-leap" in the explanation when it comes to "Word vectors in spaCy". Suddenly we have vectors associated to any word, of a predetermined dimension. Why? Where are those vectors coming from? how are they calculated? Based on which texts? Since wordtovec takes into account context the vector representations are going to be very different in technical papers, in literature, poetry, facebook posts etc. How do you create your own vectors related to a particular collection of concepts over a particular set of documents? I observed this problematic in many many word2vec tutorials. The explanation starts very smoothly, basic, very well explained up to details; and suddenly there is a big hole in the explanation. In any case this is one of the best explanations I have found on wordtovec theory. thanks

I agree
for color example
what if there where another colors but they are not in the colors dictionary how to discover them
how to use this technique in recommender systems for example?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.