Skip to content

Instantly share code, notes, and snippets.

@CallmeMehdi
Last active August 24, 2021 12:13
Show Gist options
  • Save CallmeMehdi/b32b484cb44cc847fafa7ac146e35b7e to your computer and use it in GitHub Desktop.
Save CallmeMehdi/b32b484cb44cc847fafa7ac146e35b7e to your computer and use it in GitHub Desktop.

TF-Hub text embedding modules for underrepresented languages

Mentors:

  • Morgan Roff
  • Sayak Paul
  • jaeyounkim

This is a summary of my GSoC 2021 project. In this project, I tried to produce text embedding modules trained on underrepresented languages like Arabic and Swahili and publish them on tfhub.dev.

My contribution consisted in two main features:

  • Arabic BERT model
  • Swahili word2vec based model

An important part of this project was researching the nature of embeddings, how to evaluate them and what are the available data to train models on. This consisted of creating a plan on what problems to tackle and in what languages.

Since the BERT architecture needed large computational resources, the choice here was to use an already trained BERT model to try and deepen my understanding for BERT and how to use it for embeddings.

For training from scratch, we chose a smaller model which is based on the word2vec architecture, this allowed us to train from scratch without needing much resources.

For future steps of this project, since I got familiar with these architectures, I can try to train it from scratch.


AraBERT

The first milestone aims at publishing the AraBERT open-source model on TF Hub with a good documentation on how to load, and use it.

An overview of the BERT collection on TF Hub can be showcased here: https://tfhub.dev/google/collections/bert/1

The model outputs has 3 main components:

1- Last hidden output of the model: A Component of the shape (1, sequence_length, 768), type: tensorflow.python.framework.ops.EagerTensor

2- Pooler output: Output of a layer that transforms the output shape of the Transformer from [batch_size, sequence_length, hidden_size] to [batch_size, hidden_size], which is similar to GlobalMaxPool1D. In our case the shape is (1, 768), type: tensorflow.python.framework.ops.EagerTensor

3- Hidden states: Tuple of tf.Tensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

With sequence_length the number of words in the input text

DONE:

  • Published the AraBERT model on TF Hub.
  • Created documentation for using the model with the appropriate tokenizer.
  • Created and merged pull request with the model's license.
  • Fixed the model documentation to be fit on TF Hub.

TO DO:

  • Package the tokenizer as a TF model to simply loading it from TF Hub.
  • Creating a tutorial where we:
    • Load the model
    • Load the tokenizer
    • Create an embedding
    • Evaluate it manually using some classic examples that assess embeddings like calculating the cosine distance between the words "man" and "boy", or assessing the most similar words to a word like "king" to test if it makes sense

Related links:


Swahili word2vec based model

The second milestone consists of creating and training a word2vec type embedding for the Swahili language using Wikipedia data, then use that embedding to create a single-layer model that inputs a string, tokenizes it, and outputs a 100-vector shaped embedding.

For evaluating this model, since there aren't many ways to automatically evaluate an embedding, we chose to manually evaluate it using simple examples. For instance, creating an embedding our of the word "mwanaume" which means man and assessing the cosine distance between it and "kijana" which means boy. We also look at the most similar words in the vocabulary to evaluate how well the model represents words.

DONE:

  • Created and trained a word2vec embedding for Swahili.
  • Created a Keras single-layer model out of the embeddings.
  • Created documentation for the model.
  • Created a pull request with this model.

TO DO:

  • Package the tokenizer as a TF model to simplying loading it from TF Hub.
  • Merge the pull request:
    • Fix changes in the document
    • Fix changes in documentation
  • Add a tutorial notebook for using the model: : Creating an examplary use case where we:
    • Load the model
    • Load the tokenizer
    • Create an embedding
    • Evaluate it manually using some classic examples that assess embeddings like calculating the cosine distance between the words "man" and "boy", or assessing the most similar words to a word like "king" to test if it makes sense

Related links:

@sayakpaul
Copy link

sayakpaul commented Aug 19, 2021

  • word2vec is still missing a citation.

  • Evaluate it manually using some classic examples that assess embeddings.

    As a reader, it still is not clear to me what do you mean by "using some classic examples that assess embeddings."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment