This comment has been minimized.
This comment has been minimized.
Thanks, this is great! |
This comment has been minimized.
This comment has been minimized.
awesome! very intuitive explanations |
This comment has been minimized.
This comment has been minimized.
Great tutorial, thanks! |
This comment has been minimized.
This comment has been minimized.
Not sure why I'm getting the following error, working on macOS with Jupyter Lab, Python 2.7 and Spacy 2.0.9:
|
This comment has been minimized.
This comment has been minimized.
@vnhnhm So instead of running this from the command line: I ran this command in bash: Hope this helps! |
This comment has been minimized.
This comment has been minimized.
This write up is amazing, great work! |
This comment has been minimized.
This comment has been minimized.
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 32764: character maps to I get this error when I try to open the colors json file |
This comment has been minimized.
This comment has been minimized.
I was looking for word2vec. But yours helps me a lot. Nice work! |
This comment has been minimized.
This comment has been minimized.
This is the best tutorial I've ever read! So clear and easy to understand. Thanks for making this and putting it out there! |
This comment has been minimized.
This comment has been minimized.
This is a great tutorial. Thank you! Found this one through The Coding Train channel on YouTube. |
This comment has been minimized.
This comment has been minimized.
One of the best intuitive tutorial !! |
This comment has been minimized.
This comment has been minimized.
One of the best tutorials on word to vec. Nevertheless there is a "quantum-leap" in the explanation when it comes to "Word vectors in spaCy". Suddenly we have vectors associated to any word, of a predetermined dimension. Why? Where are those vectors coming from? how are they calculated? Based on which texts? Since wordtovec takes into account context the vector representations are going to be very different in technical papers, in literature, poetry, facebook posts etc. How do you create your own vectors related to a particular collection of concepts over a particular set of documents? I observed this problematic in many many word2vec tutorials. The explanation starts very smoothly, basic, very well explained up to details; and suddenly there is a big hole in the explanation. In any case this is one of the best explanations I have found on wordtovec theory. thanks |
This comment has been minimized.
This comment has been minimized.
Corrections for future readers: For what it's worth, the spreadsheet example containing the sentence “It was the best of times, it was the worst of times.” has an incorrect value within the cell given the row “times” and the column “the ___ of”. The value should be 0 not 1. Also, the “times” row and “of __ it” column should have a 1 not 0. Over all, a nice intro document with fun examples. |
This comment has been minimized.
This comment has been minimized.
meanv function can be simplified. See it in my forked gist : https://gist.github.com/Rajan-sust/c38612d002ee5210e8b9cc4a55fce7de |
This comment has been minimized.
This comment has been minimized.
Great tutorial, thanks for sharing! |
This comment has been minimized.
This comment has been minimized.
@joseberlines |
This comment has been minimized.
This comment has been minimized.
Hi, |
This comment has been minimized.
This comment has been minimized.
Hi, |
This comment has been minimized.
This comment has been minimized.
You just got a heck of a complicated stuff and made it into some simple paragraphs! Thanks for that, thanks for this amazing tutorial. |
This comment has been minimized.
This comment has been minimized.
For the Dracula and Wallpaper examples, isn't the code only checking for one-word colors (e.g. blue, yellow, red) and not for two-word colors as well (e.g. burnt orange, tomato red, sunflower yellow)?? |
This comment has been minimized.
This comment has been minimized.
Thanks! Helped me to get ground on word vectors. |
This comment has been minimized.
This comment has been minimized.
Best explanation. Thank you very much. |
This comment has been minimized.
This comment has been minimized.
this looks super exciting and I'm eager to explore but right off this line isn't working for me? color_data = json.loads(open("xkcd.json").read()) in the notebook? FileNotFoundError Traceback (most recent call last) FileNotFoundError: [Errno 2] No such file or directory: 'xkcd.json' |
This comment has been minimized.
This comment has been minimized.
@lschomp you should install the file in the same path with notebook to work |
This comment has been minimized.
This comment has been minimized.
I agree |
This comment has been minimized.
This comment has been minimized.
Great explanation, Thank you for sharing |
This comment has been minimized.
This comment has been minimized.
The whole code has been updated for the latest versions of python and spacy. Available here as a notebook: https://www.kaggle.com/john77eipe/understanding-word-vectors |
This comment has been minimized.
This comment has been minimized.
Fantastic article thank you. How could I scale this to compare a single sentence with around a million other sentences to find the most similar ones though? I’m thinking that iteration wouldn’t be an option? Thanks! |
This comment has been minimized.
This comment has been minimized.
I have the same problem and I was thinking about the non performance of iterate over so many sentences. If you get something interesting related to this it will be great to share it, I will do the same. |
This comment has been minimized.
This comment has been minimized.
Will do. I've looked at a number of text similarity approaches and they all seem to either rely on iteration or semantic word graphs with a pre-calculated one to one similarity relationship between all the nodes which means 1M nodes = 1M x 1M relationships which is also clearly untennable and very slow to process. I'm sure I must be missing something obvious, but I'm not sure what! The only thing I can think of at the moment is pre-processing the similarity by saving the vectors for each sentence against their database record, then iterating for each sentence against each other sentence in the way described in this article (or any other similarity distance function), and then saving a one to one graph similarity relationship only for items that are highly similar (to reduce the number of similarity relationships created only to relevent ones). i.e. don't do the iteration at run-time, but rather do the iteration once and save the resulting similarities where of high similarity in node relationships which would be quick to query at runtime. I'd welcome any guidance from others on this though! Has anyone tried this approach? |
This comment has been minimized.
This comment has been minimized.
It looks like a pre-trained approximate nearest neighbour approach may be a good option where you have large numbers of vectors. I've not yet tried this, but here is the logic https://erikbern.com/2015/09/24/nearest-neighbor-methods-vector-models-part-1.html and here is an implementation https://medium.com/@kevin_yang/simple-approximate-nearest-neighbors-in-python-with-annoy-and-lmdb-e8a701baf905 Using the Annoy library, essentially the approach here is to create an lmdb map and an Annoy index with the word embeddings. Then save those to disk. At runtime, load these, vectorise your query text and use Annoy to look up n nearest neighbours and return their IDs. Anyone have experience of this with sentences rather than just words? |
This comment has been minimized.
This comment has been minimized.
Awesome good! |
This comment has been minimized.
This comment has been minimized.
Vey intuitive tutorial. Thank you! |
This comment has been minimized.
This comment has been minimized.
replace def vec(s):
return nlp.vocab[s].vector with def vec(s):
return nlp(s).vector |
This comment has been minimized.
This comment has been minimized.
very good explanation |
This comment has been minimized.
This comment has been minimized.
Enjoyed reading this. Thank you! |
This comment has been minimized.
This comment has been minimized.
I agree! I thought I had deleted many cells and downloaded it again looking for the gap. |
This comment has been minimized.
This comment has been minimized.
When I ran snippets of code that access a library, it gave me errors like this: "FileNotFoundError: [Errno 2] No such file or directory: 'pg345.txt'". And same thing with the color file: "FileNotFoundError: [Errno 2] No such file or directory: 'xkcd.json'" Note: I tried doing it in Visual Code but it gave me the same problem, even after saving it in the same directory. Also i've read online to use the absolute path, but it still would not work. |
This comment has been minimized.
Very nice tutorial!