Skip to content

Instantly share code, notes, and snippets.

@ethankoch4
Created October 24, 2018 14:39
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ethankoch4/ef484d33b18c3cde557e09500a441c1d to your computer and use it in GitHub Desktop.
Save ethankoch4/ef484d33b18c3cde557e09500a441c1d to your computer and use it in GitHub Desktop.
Teach Me ELMo Word Embeddings Without Math or Code

Have you ever wanted to learn about what an algorithm is at a high-level without being walked through its Java implementation? Ever sifted through the web to find that one post that isn't throwing up the complex math equation which defines the algorithm?

This post will explain ELMo without using any math or code. If you want to know how to implement ELMo, there are plenty of resources out there for you already:

- [Great explanation and use in Keras to train an NER model](https://www.depends-on-the-definition.com/named-entity-recognition-with-residual-lstm-and-elmo/)
- [How-To guide from the authors of the original paper](https://github.com/allenai/allennlp/blob/master/tutorials/how_to/elmo.md)
- [How to use the pre-trained model](https://tfhub.dev/google/elmo/2)
- [You could always read through the source code, tensorflow version ;)](https://github.com/allenai/bilm-tf)

What Are ELMo Embeddings?

ELMo embeddings are, in essence, simply word embeddings that are a combination of other word embeddings. The reason you may find it complicated to understand ELMo embeddings is that they can be used in multiple different ways. For the sake of clarity, I will only focus on one way to implement them for now.

The problem ELMo embeddings can help solve is that the context of a word has a large effect on the meaning of that word. When word embeddings first came out, there was only a single embedding for each word. Or, at best, a single embedding for each word-sense was computed. ELMo is a step in the direction of tackling this problem.

The way this works is that the ELMo authors have pre-trained a biLSTM language model. A biLSTM merges a forward language model and a backward language model. All you need to know about these:

1) A forward language model predicts the next word given its previous words
2) A backward language model predicts the previous word given its following words

The biLSTM has multiple layers in it and each layer contains information about its context. This means we should be able to use the information from each of these layers to help provide a better word embedding. This is exactly what ELMo embeddings do. An ELMo embedding can be the representation of your word in layer 1 and layer 2 concatenated together. That ELMo embedding could be input directly into you prediction model. Here is a simple diagram of what that architecture may look like:

You may have realized how easy it would be to tweak the model's inputs or merge the biLSTM layers in a different way. In fact, an non-exhaustive list of the different ways you could use ELMo embeddings are:

1) Change the input of the language model to be characters instead of words
2) Use a weighted sum between the layers representations to obtain a word embedding
3) Change the input of the language model to be word embeddings from another algorithm
4) Concatenate the layers of the ELMo biLSTM with word embeddings from another algorithm to input to your prediction model

When Should I Use ELMo Embeddings?

You should use ELMo embeddings if you are concerned about the context-dependent meanings of words harming your prediction model's performance. An example: I am training a topic classification model on a corpus of text that contains 'bat' in the sense of baseball and 'bat' in the sense of an animal. The model would have a harder time classifying a text with the word 'bat' because of the different meanings.

You can use ELMo embeddings if you have not trained a word embedding algorithm already. You can use them if you have trained a word embedding algorithm already. You can use ELMo embeddings if you are concerned about out-of-vocabulary words harming your model's prediction accuracy. You can use them if you just want to learn more about Natural Language Processing and Deep Learning.

Plugging in ELMo embeddings to your existing deep learning pipeline is quite simple. In fact, the resources I listed at the top of this article are a great way to get started.

What Did I Learn?

You learned ELMo embeddings are useful for context-dependent word representations. You learned ELMo embeddings can be added easily to your existing NLP/DL pipeline. You learned how generating the ELMo embeddings can be customized to best fit your use-case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment