Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Understanding word vectors: A tutorial for "Reading and Writing Electronic Text," a class I teach at ITP. (Python 2.7) Code examples released under CC0 https://creativecommons.org/choose/zero/, other text released under CC BY 4.0 https://creativecommons.org/licenses/by/4.0/
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@DouglasLapsley
Copy link

DouglasLapsley commented Feb 19, 2020

Fantastic article thank you. How could I scale this to compare a single sentence with around a million other sentences to find the most similar ones though? I’m thinking that iteration wouldn’t be an option? Thanks!

I have the same problem and I was thinking about the non performance of iterate over so many sentences. If you get something interesting related to this it will be great to share it, I will do the same.

It looks like a pre-trained approximate nearest neighbour approach may be a good option where you have large numbers of vectors. I've not yet tried this, but here is the logic https://erikbern.com/2015/09/24/nearest-neighbor-methods-vector-models-part-1.html and here is an implementation https://medium.com/@kevin_yang/simple-approximate-nearest-neighbors-in-python-with-annoy-and-lmdb-e8a701baf905

Using the Annoy library, essentially the approach here is to create an lmdb map and an Annoy index with the word embeddings. Then save those to disk. At runtime, load these, vectorise your query text and use Annoy to look up n nearest neighbours and return their IDs.

Anyone have experience of this with sentences rather than just words?

@juhanishen
Copy link

juhanishen commented Jun 16, 2020

Awesome good!

@juliansteam
Copy link

juliansteam commented Aug 14, 2020

Vey intuitive tutorial. Thank you!

@erkekin
Copy link

erkekin commented Aug 25, 2020

Not sure why I'm getting the following error, working on macOS with Jupyter Lab, Python 2.7 and Spacy 2.0.9:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-2-090b6e832a74> in <module>()
      3 # It creates a list of unique words in the text
      4 tokens = list(set([w.text for w in doc if w.is_alpha]))
----> 5 print nlp.vocab['cheese'].vector

lexeme.pyx in spacy.lexeme.Lexeme.vector.__get__()

ValueError: Word vectors set to length 0. This may be because you don't have a model installed or loaded, or because your model doesn't include word vectors. For more info, see the documentation: 
https://spacy.io/usage/models

replace nlp.vocab['cheese'].vector with nlp('cheese').vector
and

def vec(s):
    return nlp.vocab[s].vector

with

def vec(s):
    return nlp(s).vector

@motahher
Copy link

motahher commented Dec 10, 2020

very good explanation

@JenPink25
Copy link

JenPink25 commented Dec 18, 2020

Enjoyed reading this. Thank you!

@lewiuberg
Copy link

lewiuberg commented Jan 4, 2021

One of the best tutorials on word to vec. Nevertheless there is a "quantum-leap" in the explanation when it comes to "Word vectors in spaCy". Suddenly we have vectors associated to any word, of a predetermined dimension. Why? Where are those vectors coming from? how are they calculated? Based on which texts? Since wordtovec takes into account context the vector representations are going to be very different in technical papers, in literature, poetry, facebook posts etc. How do you create your own vectors related to a particular collection of concepts over a particular set of documents? I observed this problematic in many many word2vec tutorials. The explanation starts very smoothly, basic, very well explained up to details; and suddenly there is a big hole in the explanation. In any case this is one of the best explanations I have found on wordtovec theory. thanks

I agree! I thought I had deleted many cells and downloaded it again looking for the gap.

@jdmedenilla
Copy link

jdmedenilla commented Feb 12, 2021

When I ran snippets of code that access a library, it gave me errors like this: "FileNotFoundError: [Errno 2] No such file or directory: 'pg345.txt'". And same thing with the color file: "FileNotFoundError: [Errno 2] No such file or directory: 'xkcd.json'"
I ran those on jupyter notebook. Do you know what's wrong?

Note: I tried doing it in Visual Code but it gave me the same problem, even after saving it in the same directory. Also i've read online to use the absolute path, but it still would not work.

@Zaravanon
Copy link

Zaravanon commented Mar 4, 2021

Great, Thank You!

@tugcekiziltepe
Copy link

tugcekiziltepe commented Apr 28, 2021

Great, well-explained tutorial, thank you!

@prakashr7d
Copy link

prakashr7d commented May 28, 2021

Not sure why I'm getting the following error, working on macOS with Jupyter Lab, Python 2.7 and Spacy 2.0.9:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-2-090b6e832a74> in <module>()
      3 # It creates a list of unique words in the text
      4 tokens = list(set([w.text for w in doc if w.is_alpha]))
----> 5 print nlp.vocab['cheese'].vector

lexeme.pyx in spacy.lexeme.Lexeme.vector.__get__()

ValueError: Word vectors set to length 0. This may be because you don't have a model installed or loaded, or because your model doesn't include word vectors. For more info, see the documentation: 
https://spacy.io/usage/models

You want to download 'en_core_web_lg' model

@saiankit
Copy link

saiankit commented Aug 31, 2021

OMG !! Really had a great time reading this beautiful gist. Very well explained.

@DavidHarar
Copy link

DavidHarar commented Sep 5, 2021

Thanks!

@mikeolubode
Copy link

mikeolubode commented Feb 4, 2022

I was led here by a tutorial on word vectors from youtube. Thanks for the simplicity!

@yishairasowsky
Copy link

yishairasowsky commented Mar 14, 2022

very good

@robertocsa
Copy link

robertocsa commented Sep 15, 2022

Thank you for sharing this. Excelent job!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment