Skip to content

Instantly share code, notes, and snippets.

@ThomasDelteil
Created April 19, 2019 06:02
Show Gist options
  • Save ThomasDelteil/a8b911500395d899881ad3bd11381715 to your computer and use it in GitHub Desktop.
Save ThomasDelteil/a8b911500395d899881ad3bd11381715 to your computer and use it in GitHub Desktop.
Supaero
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Gluon-NLP"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"1. Pre-trained word embeddings\n",
"2. Pre-trained language models\n",
"3. Fine-tuning BERT \n",
"\n",
"http://gluon-nlp.mxnet.io/"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Pre-trained word embeddings"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"source": [
"Here we introduce how to use pre-trained word embeddings, where each word is represened by a vector. Two popular word embeddings are GloVe and fastText. The used GloVe and fastText pre-trained word embeddings here are from the following sources:\n",
"\n",
"* GloVe project website:https://nlp.stanford.edu/projects/glove/\n",
"* fastText project website:https://fasttext.cc/\n",
"\n",
"Let us first import the following packages used in this example."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![](https://cdn-images-1.medium.com/max/1600/1*2r1yj0zPAuaSGZeQfG6Wtw.png)"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"ExecuteTime": {
"end_time": "2018-06-06T21:56:37.022037Z",
"start_time": "2018-06-06T21:56:35.507701Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"import mxnet as mx\n",
"from mxnet import gluon, nd\n",
"import gluonnlp as nlp\n",
"import re"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We pick a specific pre-trained embedding"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"ExecuteTime": {
"end_time": "2018-06-06T21:56:42.177372Z",
"start_time": "2018-06-06T21:56:39.926375Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"embedding = nlp.embedding.create('glove', source='glove.6B.50d')"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"ExecuteTime": {
"end_time": "2018-06-06T21:56:43.752190Z",
"start_time": "2018-06-06T21:56:42.180013Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"vocab = nlp.Vocab(nlp.data.Counter(embedding.idx_to_token))\n",
"vocab.set_embedding(embedding)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"source": [
"Below shows the size of `vocab` including a special unknown token."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"ExecuteTime": {
"end_time": "2018-06-06T21:56:43.761479Z",
"start_time": "2018-06-06T21:56:43.754645Z"
},
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"400004"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(vocab.idx_to_token)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"source": [
"We can access attributes of `vocab`."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"ExecuteTime": {
"end_time": "2018-06-06T21:56:43.772797Z",
"start_time": "2018-06-06T21:56:43.764792Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"71424\n",
"beautiful\n"
]
}
],
"source": [
"print(vocab['beautiful'])\n",
"print(vocab.idx_to_token[71424])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![](support/cosinesimilarity.png)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"ExecuteTime": {
"end_time": "2018-06-06T21:56:43.791341Z",
"start_time": "2018-06-06T21:56:43.776953Z"
},
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [],
"source": [
"def cos_sim(x, y):\n",
" return nd.dot(x, y) / (nd.norm(x) * nd.norm(y))"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Word Similarity"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"source": [
"Given an input word, we can find the nearest $k$ words from the vocabulary (400,000 words excluding the unknown token) by similarity. The similarity between any pair of words can be represented by the cosine similarity of their vectors."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"ExecuteTime": {
"end_time": "2018-06-06T21:56:44.041379Z",
"start_time": "2018-06-06T21:56:44.016299Z"
},
"slideshow": {
"slide_type": "notes"
}
},
"outputs": [],
"source": [
"def norm_vecs_by_row(x):\n",
" return x / nd.sqrt(nd.sum(x * x, axis=1)).reshape((-1,1))\n",
"\n",
"def get_knn(vocab, k, word):\n",
" word_vec = vocab.embedding[word].reshape((-1, 1))\n",
" vocab_vecs = norm_vecs_by_row(vocab.embedding.idx_to_vec)\n",
" dot_prod = nd.dot(vocab_vecs[4:], word_vec)\n",
" indices = nd.topk(dot_prod.squeeze(), k=k+1, ret_typ='indices')\n",
" indices = [int(i.asscalar())+4 for i in indices]\n",
" # Remove unknown and input tokens.\n",
" return vocab.to_tokens(indices[1:])"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"source": [
"Let us find the 5 most similar words of 'baby' from the vocabulary (size: 400,000 words)."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"ExecuteTime": {
"end_time": "2018-06-06T21:56:44.508966Z",
"start_time": "2018-06-06T21:56:44.044708Z"
},
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"['babies', 'boy', 'girl', 'newborn', 'pregnant']"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"get_knn(vocab, 5, 'baby')"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"We can verify the cosine similarity of vectors of 'baby' and 'babies'."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"ExecuteTime": {
"end_time": "2018-06-06T21:56:44.520428Z",
"start_time": "2018-06-06T21:56:44.511361Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"\n",
"[0.83871305]\n",
"<NDArray 1 @cpu(0)>"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"cos_sim(vocab.embedding['baby'], vocab.embedding['babies'])"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Let us find the 5 most similar words of 'run' from the vocabulary."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"ExecuteTime": {
"end_time": "2018-06-06T21:56:45.464662Z",
"start_time": "2018-06-06T21:56:44.957783Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"['study', 'institute', 'studies', 'scientific', 'science']"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"get_knn(vocab, 5, 'research')"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"Let us find the 5 most similar words of 'beautiful' from the vocabulary."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"ExecuteTime": {
"end_time": "2018-06-06T21:56:46.093124Z",
"start_time": "2018-06-06T21:56:45.468022Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"['computers', 'software', 'technology', 'electronic', 'internet']"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"get_knn(vocab, 5, 'computer')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Challenge**\n",
"\n",
"Try out the `get_knn` function with a word of your own"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Word Analogy"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "notes"
}
},
"source": [
"We can also apply pre-trained word embeddings to the word analogy problem. For instance, \"man : woman :: son : daughter\" is an analogy. The word analogy completion problem is defined as: for analogy 'a : b :: c : d', given teh first three words 'a', 'b', 'c', find 'd'. The idea is to find the most similar word vector for vec('c') + (vec('b')-vec('a')).\n",
"\n",
"In this example, we will find words by analogy from the 400,000 indexed words in `vocab`."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"ExecuteTime": {
"end_time": "2018-06-06T21:56:46.158875Z",
"start_time": "2018-06-06T21:56:46.103712Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [],
"source": [
"def get_top_k_by_analogy(vocab, k, word1, word2, word3):\n",
" word_vecs = vocab.embedding[word1, word2, word3]\n",
" \n",
" word_diff = (word_vecs[1] - word_vecs[0] + word_vecs[2])\n",
" \n",
" vocab_vecs = norm_vecs_by_row(vocab.embedding.idx_to_vec)\n",
" dot_prod = nd.dot(vocab_vecs[4:], word_diff.squeeze()).squeeze()\n",
" \n",
" indices = dot_prod.topk(k=k+1, ret_typ='indices')\n",
" indices = [int(i.asscalar())+4 for i in indices]\n",
" words = [w for w in vocab.to_tokens(indices) if w != word3]\n",
" return words[:k]"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Semantic Analogy\n",
"\n",
"![analogy](https://user-images.githubusercontent.com/3716307/53924875-a1497880-4032-11e9-847c-2d826d0ee0ee.png)\n"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"ExecuteTime": {
"end_time": "2018-06-06T21:56:46.626485Z",
"start_time": "2018-06-06T21:56:46.161792Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"['daughter']"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"get_top_k_by_analogy(vocab, 1, 'man', 'woman', 'son')"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"source": [
"Let us verify the cosine similarity between vec('son')+vec('woman')-vec('man') and vec('daughter')"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"ExecuteTime": {
"end_time": "2018-06-06T21:56:46.644371Z",
"start_time": "2018-06-06T21:56:46.629075Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"\n",
"[0.9658341]\n",
"<NDArray 1 @cpu(0)>"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def cos_sim_word_analogy(vocab, word1, word2, word3, word4):\n",
" words = [word1, word2, word3, word4]\n",
" vecs = vocab.embedding[words]\n",
" return cos_sim(vecs[1] - vecs[0] + vecs[2], vecs[3])\n",
"\n",
"cos_sim_word_analogy(vocab, 'man', 'woman', 'son', 'daughter')"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"ExecuteTime": {
"end_time": "2018-06-06T21:56:47.104740Z",
"start_time": "2018-06-06T21:56:46.647576Z"
},
"scrolled": true,
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"['nfl']"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"get_top_k_by_analogy(vocab, 1, 'celtics', 'nba', 'patriots')"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"['cricket']"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"get_top_k_by_analogy(vocab, 1, 'france', 'football', 'india')"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"['blue']"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"get_top_k_by_analogy(vocab, 1, 'wine', 'red', 'sky')"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['paris']"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"get_top_k_by_analogy(vocab, 1, 'russia', 'moscow', 'france')"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Syntactic Analogy"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {
"ExecuteTime": {
"end_time": "2018-06-06T21:56:47.502780Z",
"start_time": "2018-06-06T21:56:47.107420Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"['biggest']"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"get_top_k_by_analogy(vocab, 1, 'bad', 'worst', 'big')"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"ExecuteTime": {
"end_time": "2018-06-06T21:56:47.898396Z",
"start_time": "2018-06-06T21:56:47.505389Z"
},
"slideshow": {
"slide_type": "fragment"
}
},
"outputs": [
{
"data": {
"text/plain": [
"['went']"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"get_top_k_by_analogy(vocab, 1, 'do', 'did', 'go')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Challenge**\n",
"\n",
"write one semantic and one syntactic analogy using `get_top_k_by_analogy`"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"### Application\n",
"\n",
"- Language Modelling\n",
"- Neural Machine Translation\n",
"- Text classification"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Language Modelling\n",
"Generating text with a pre-trained language model"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"ctx = mx.gpu() if mx.context.num_gpus() > 0 else mx.cpu()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"![language_modelling](https://user-images.githubusercontent.com/3716307/53928083-b4157a80-403d-11e9-950c-670299db4d5f.png)\n"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"lm_model, vocab = nlp.model.get_model(name='big_rnn_lm_2048_512',\n",
" dataset_name='gbw',\n",
" pretrained=True,\n",
" ctx=ctx)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We need a decoder function that can be called recursively by our sampler"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
"class Decoder:\n",
" def __init__(self, model):\n",
" self.model = model\n",
" \n",
" def __call__(self, inputs, states):\n",
" outputs, states = self.model(mx.nd.expand_dims(inputs, axis=0), states)\n",
" return outputs[0], states\n",
"\n",
"decoder = Decoder(lm_model)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Instead of taking the most probable value for the next token, we sample from the distribution"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
"sampler = nlp.model.SequenceSampler(beam_size=1,\n",
" decoder=decoder,\n",
" eos_id=vocab['<eos>'],\n",
" max_length=20,\n",
" temperature=0.30)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Setting up the prompt for our language model"
]
},
{
"cell_type": "code",
"execution_count": 64,
"metadata": {},
"outputs": [],
"source": [
"prompt = \"The engineers have devised a new\"\n",
"prompt_tokenized = prompt.split(' ')\n",
"bos_ids = [vocab[ele] for ele in prompt_tokenized]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Getting the initial states of the model"
]
},
{
"cell_type": "code",
"execution_count": 65,
"metadata": {},
"outputs": [],
"source": [
"init_states = lm_model.begin_state(batch_size=1, ctx=ctx)\n",
"_, sampling_states = lm_model(mx.nd.expand_dims(mx.nd.array(bos_ids[:-1], ctx=ctx), axis=1), init_states)\n",
"\n",
"inputs = mx.nd.full(shape=(1,), ctx=ctx, val=bos_ids[-1])"
]
},
{
"cell_type": "code",
"execution_count": 73,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"The engineers have devised a new way to make the world 's most efficient cars . <eos>\n"
]
}
],
"source": [
"samples, _, valid_lengths = sampler(inputs, sampling_states)\n",
"for sample, valid_length in zip(samples[0].asnumpy(), valid_lengths[0].asnumpy()):\n",
" sentence = prompt_tokenized[:-1] +[vocab.idx_to_token[ele] for ele in sample[:valid_length]]\n",
" print(' '.join(sentence))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Challenge**\n",
"\n",
"- Modify the prompt to generate sentences of your own.\n",
"\n",
"- Modify the temperature parameter to evaluate its impact on token sampling"
]
}
],
"metadata": {
"celltoolbar": "Slideshow",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment