Skip to content

Instantly share code, notes, and snippets.

@nulligor
Last active August 24, 2019 22:24
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save nulligor/3d04e2abaf33a1b639dd9d9b9220022d to your computer and use it in GitHub Desktop.
Save nulligor/3d04e2abaf33a1b639dd9d9b9220022d to your computer and use it in GitHub Desktop.
PT-BR Word Embedding Visualization Ideas
== Read Comment Section ==
@nulligor
Copy link
Author

nulligor commented Jul 24, 2019

Word2vec

What is:

Word2Vec was introduced in two papers between September and October 2013, by a team of researchers at Google. Along with the papers, the researchers published their implementation in C.

The underlying assumption of Word2Vec is that two words sharing similar contexts also share a similar meaning and consequently a similar vector representation from the model. For instance: "dog", "puppy" and "pup" are often used in similar situations, with similar surrounding words like "good", "fluffy" or "cute", and according to Word2Vec they will therefore share a similar vector representation.

From this assumption, Word2Vec can be used to find out the relations between words in a dataset, compute the similarity between them, or use the vector representation of those words as input for other applications such as text classification or clustering.

It's essentially a KV N-dimension coordinate with it's word value as key and it's location in this N-dimensional space as a value.

Preparing the data

from django.utils.text import slugify
from functools import partial

# https://raw.githubusercontent.com/pythonprobr/palavras/master/palavras.txt
initial_words = open("palavras.txt")

# http://nilc.icmc.usp.br/nilc/index.php/repositorio-de-word-embeddings-do-nilc
# Pre-trained word embeddings in pt-br
initial_skipagram = open("skip_s300.txt")

def filter_skipagram(coll, string):
   return re.sub("\W+","","".join([i for i in string if not i.isdigit()])) in coll

union = list(filter(partial(filter_skipagram, [w.rstrip() for w in initial_words.readlines()]),
            [w.rstrip() for w in initial_skipagram.readlines()][1:]))

# Filter against external API
def filter_word2vec_external(string):
    r = requests.get(f"https://dicionariocriativo.com.br/{slugify(string.split()[0])}")
    return not "Sem resultados" in str(r.content)

union_2 = list(filter(filter_word2vec_external, union))

# We can further sanitize our input by removing words with special characters in it and doing other alterations

Using the Gensim Library

After we sanitize our dataset we can then MMAP what it's left of it into memory for fast accessing and fast indexing

from gensim.models import KeyedVectors

model = KeyedVectors.load_word2vec_format('word2vec.txt')
model.init_sims(replace=True)
model.save('model')

We can use the model to have insights such as:

print(model.most_similar("homem"))
# [('rapaz', 0.6592525839805603), ('indivíduo', 0.6346408724784851), ('garoto', 0.6111440658569336), ('idoso', 0.5899262428283691), ('bandido', 0.548312783241272), ('forasteiro', 0.542189359664917), ('menino', 0.5391571521759033), ('mendigo', 0.5349756479263306), ('jovem', 0.5300593972206116), ('fugitivo', 0.5288075804710388)]

Essentially, it is performed an Euclidian distance between "homem" and all other N-dimensional words (300 dimension skipagram model used for these results).

MongoDB and Graph Modeling

The idea was to set a threshold of 0.6 and "Draw an Edge" which in this document-based approach basically means "allowing" the record to be persisted with it's "related words".

We would use Celery to asynchronously process every model.vocab, for "homem" we would persist as:

{
  "label": "homem",
  "related": ["rapaz", "indivíduo", "garoto"]
}

Using MongoDB's graphLookup we could perform an in-depth traversal of this "graph document".

View

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment