CallmeMehdi/GSOC2021-Report.md

## GSOC2021-Report.md

      
    Raw
  

              GSOC2021-Report.md
            
          
    TF-Hub text embedding modules for underrepresented languages

Mentors:

Morgan Roff
Sayak Paul
jaeyounkim

This is a summary of my GSoC 2021 project. In this project, I tried to produce text embedding modules trained on underrepresented languages like Arabic and Swahili and publish them on tfhub.dev.
My contribution consisted in two main features:

Arabic BERT model
Swahili word2vec based model


An important part of this project was researching the nature of embeddings, how to evaluate them and what are the available data to train models on. This consisted of creating a plan on what problems to tackle and in what languages.
Since the BERT architecture needed large computational resources, the choice here was to use an already trained BERT model to try and deepen my understanding for BERT and how to use it for embeddings.
For training from scratch, we chose a smaller model which is based on the word2vec architecture, this allowed us to train from scratch without needing much resources.
For future steps of this project, since I got familiar with these architectures, I can try to train it from scratch.


AraBERT

The first milestone aims at publishing the AraBERT open-source model on TF Hub with a good documentation on how to load, and use it.
An overview of the BERT collection on TF Hub can be showcased here: https://tfhub.dev/google/collections/bert/1
The model outputs has 3 main components:
1- Last hidden output of the model: A Component of the shape (1, sequence_length, 768), type: tensorflow.python.framework.ops.EagerTensor
2- Pooler output: Output of a layer that transforms the output shape of the Transformer from [batch_size, sequence_length, hidden_size] to [batch_size, hidden_size], which is similar to GlobalMaxPool1D. In our case the shape is (1, 768), type: tensorflow.python.framework.ops.EagerTensor
3- Hidden states: Tuple of tf.Tensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).
With sequence_length the number of words in the input text
DONE:

Published the AraBERT model on TF Hub.
Created documentation for using the model with the appropriate tokenizer.
Created and merged pull request with the model's license.
Fixed the model documentation to be fit on TF Hub.

TO DO:

Package the tokenizer as a TF model to simply loading it from TF Hub.
Creating a tutorial where we:

Load the model
Load the tokenizer
Create an embedding
Evaluate it manually using some classic examples that assess embeddings like calculating the cosine distance between the words "man" and "boy", or assessing the most similar words to a word like "king" to test if it makes sense


Related links:

TF Hub Model :  The AraBERT model published on GitHub with
Pull Request :  A pull Request to add the model to TF Hub.
Additional Pull Request :  Additional Pull Request that fixes the readme file.


Swahili word2vec based model

The second milestone consists of creating and training a word2vec type embedding for the Swahili language using Wikipedia data, then use that embedding to create a single-layer model that inputs a string, tokenizes it, and outputs a 100-vector shaped embedding.
For evaluating this model, since there aren't many ways to automatically evaluate an embedding, we chose to manually evaluate it using simple examples. For instance, creating an embedding our of the word "mwanaume" which means man and assessing the cosine distance between it and "kijana" which means boy. We also look at the most similar words in the vocabulary to evaluate how well the model represents words.
DONE:

Created and trained a word2vec embedding for Swahili.
Created a Keras single-layer model out of the embeddings.
Created documentation for the model.
Created a pull request with this model.

TO DO:

Package the tokenizer as a TF model to simplying loading it from TF Hub.
Merge the pull request:

Fix changes in the document
Fix changes in documentation


Add a tutorial notebook for using the model: : Creating an examplary use case where we:

Load the model
Load the tokenizer
Create an embedding
Evaluate it manually using some classic examples that assess embeddings like calculating the cosine distance between the words "man" and "boy", or assessing the most similar words to a word like "king" to test if it makes sense


Related links:

Pull Request :  A pull Request to add the model to TF Hub.