Skip to content

Instantly share code, notes, and snippets.

@CallmeMehdi
Last active August 24, 2021 12:13
Show Gist options
  • Save CallmeMehdi/b32b484cb44cc847fafa7ac146e35b7e to your computer and use it in GitHub Desktop.
Save CallmeMehdi/b32b484cb44cc847fafa7ac146e35b7e to your computer and use it in GitHub Desktop.

TF-Hub text embedding modules for underrepresented languages

Mentors:

  • Morgan Roff
  • Sayak Paul
  • jaeyounkim

This is a summary of my GSoC 2021 project. In this project, I tried to produce text embedding modules trained on underrepresented languages like Arabic and Swahili and publish them on tfhub.dev.

My contribution consisted in two main features:

  • Arabic BERT model
  • Swahili word2vec based model

An important part of this project was researching the nature of embeddings, how to evaluate them and what are the available data to train models on. This consisted of creating a plan on what problems to tackle and in what languages.

Since the BERT architecture needed large computational resources, the choice here was to use an already trained BERT model to try and deepen my understanding for BERT and how to use it for embeddings.

For training from scratch, we chose a smaller model which is based on the word2vec architecture, this allowed us to train from scratch without needing much resources.

For future steps of this project, since I got familiar with these architectures, I can try to train it from scratch.


AraBERT

The first milestone aims at publishing the AraBERT open-source model on TF Hub with a good documentation on how to load, and use it.

An overview of the BERT collection on TF Hub can be showcased here: https://tfhub.dev/google/collections/bert/1

The model outputs has 3 main components:

1- Last hidden output of the model: A Component of the shape (1, sequence_length, 768), type: tensorflow.python.framework.ops.EagerTensor

2- Pooler output: Output of a layer that transforms the output shape of the Transformer from [batch_size, sequence_length, hidden_size] to [batch_size, hidden_size], which is similar to GlobalMaxPool1D. In our case the shape is (1, 768), type: tensorflow.python.framework.ops.EagerTensor

3- Hidden states: Tuple of tf.Tensor (one for the output of the embeddings + one for the output of each layer) of shape (batch_size, sequence_length, hidden_size).

With sequence_length the number of words in the input text

DONE:

  • Published the AraBERT model on TF Hub.
  • Created documentation for using the model with the appropriate tokenizer.
  • Created and merged pull request with the model's license.
  • Fixed the model documentation to be fit on TF Hub.

TO DO:

  • Package the tokenizer as a TF model to simply loading it from TF Hub.
  • Creating a tutorial where we:
    • Load the model
    • Load the tokenizer
    • Create an embedding
    • Evaluate it manually using some classic examples that assess embeddings like calculating the cosine distance between the words "man" and "boy", or assessing the most similar words to a word like "king" to test if it makes sense

Related links:


Swahili word2vec based model

The second milestone consists of creating and training a word2vec type embedding for the Swahili language using Wikipedia data, then use that embedding to create a single-layer model that inputs a string, tokenizes it, and outputs a 100-vector shaped embedding.

For evaluating this model, since there aren't many ways to automatically evaluate an embedding, we chose to manually evaluate it using simple examples. For instance, creating an embedding our of the word "mwanaume" which means man and assessing the cosine distance between it and "kijana" which means boy. We also look at the most similar words in the vocabulary to evaluate how well the model represents words.

DONE:

  • Created and trained a word2vec embedding for Swahili.
  • Created a Keras single-layer model out of the embeddings.
  • Created documentation for the model.
  • Created a pull request with this model.

TO DO:

  • Package the tokenizer as a TF model to simplying loading it from TF Hub.
  • Merge the pull request:
    • Fix changes in the document
    • Fix changes in documentation
  • Add a tutorial notebook for using the model: : Creating an examplary use case where we:
    • Load the model
    • Load the tokenizer
    • Create an embedding
    • Evaluate it manually using some classic examples that assess embeddings like calculating the cosine distance between the words "man" and "boy", or assessing the most similar words to a word like "king" to test if it makes sense

Related links:

@MorganR
Copy link

MorganR commented Aug 17, 2021

Thank you Mehdi! Some notes to improve your report:

  1. Research was an important part of your project early on that took significant time. It's worth mentioning that you started by looking for embedding tasks that could be used to assess models in underrepresented languages, and also that you looked for existing work in this area. This also helps clarify where AraBERT came from, and why you chose a simple word2vec model for Swahili.
  2. Could you clarify what you mean by "Created a Keras model out of the AraBERT open-source model." Wasn't the AraBERT model already in TF format? I don't recall seeing code for this, but if there is some, then that would also be a great thing to add to your repo.
  3. Another TODO that would be nice for the AraBERT model would be to package the tokenizer into a TF model as well. This was done for other BERT models on Hub (eg), and greatly simplifies integration with the model for users.
  4. It would help to include more information about the TODOs. What is your plan for completing them?

@sayakpaul
Copy link

Package the tokenizer as a TF model to simplying loading it from TF Hub.

Seems like a typo.

Add a tutorial notebook for using the model: Creating an examplary use case where we:

We can further simplify this with something like - Creating a tutorial where we:.

Evaluate it manually

Makes sense to also expand a bit more on this and comment on how you'd like to evaluate it.

TF Hub Modedl

Typo.

Add a tutorial notebook for using the model:

This suffices.

On a related note, you can refer to the BERT collection on TF-Hub to understand how the encoder models have referenced their corresponding preprocessing modules.

Lastly please provide references to model architectures you are referring to throughout your report. You can either directly hyperlink them like BERT or BERT [1] and reference them later under a separate section called "References". Citing these for the first time is sufficient. This means if when you are mentioning BERT for the first time cite it and for the rest of the mentions you may not cite.

@sayakpaul
Copy link

sayakpaul commented Aug 19, 2021

  • word2vec is still missing a citation.

  • Evaluate it manually using some classic examples that assess embeddings.

    As a reader, it still is not clear to me what do you mean by "using some classic examples that assess embeddings."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment