Last active
August 24, 2019 22:24
-
-
Save nulligor/3d04e2abaf33a1b639dd9d9b9220022d to your computer and use it in GitHub Desktop.
PT-BR Word Embedding Visualization Ideas
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
== Read Comment Section == |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Word2vec
What is:
Word2Vec was introduced in two papers between September and October 2013, by a team of researchers at Google. Along with the papers, the researchers published their implementation in C.
The underlying assumption of Word2Vec is that two words sharing similar contexts also share a similar meaning and consequently a similar vector representation from the model. For instance: "dog", "puppy" and "pup" are often used in similar situations, with similar surrounding words like "good", "fluffy" or "cute", and according to Word2Vec they will therefore share a similar vector representation.
From this assumption, Word2Vec can be used to find out the relations between words in a dataset, compute the similarity between them, or use the vector representation of those words as input for other applications such as text classification or clustering.
It's essentially a KV N-dimension coordinate with it's word value as key and it's location in this N-dimensional space as a value.
Preparing the data
Using the Gensim Library
After we sanitize our dataset we can then MMAP what it's left of it into memory for fast accessing and fast indexing
We can use the model to have insights such as:
Essentially, it is performed an Euclidian distance between "homem" and all other N-dimensional words (300 dimension skipagram model used for these results).
MongoDB and Graph Modeling
The idea was to set a threshold of
0.6
and "Draw an Edge" which in this document-based approach basically means "allowing" the record to be persisted with it's "related words".We would use Celery to asynchronously process every
model.vocab
, for"homem"
we would persist as:Using MongoDB's graphLookup we could perform an in-depth traversal of this "graph document".
View