ThomasDelteil/1_gluon_cv.ipynb

## 1_gluon_cv.ipynb

      
Display the source blob

    
Display the rendered blob

    
    Raw
  

              1_gluon_cv.ipynb
            
          
        Loading

      Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
          Viewer requires iframe.
      
    
## 2_gluon_nlp_embed_lm.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Gluon-NLP"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. Pre-trained word embeddings\n",
    "2. Pre-trained language models\n",
    "3. Fine-tuning BERT \n",
    "\n",
    "http://gluon-nlp.mxnet.io/"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "## Pre-trained word embeddings"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "Here we introduce how to use pre-trained word embeddings, where each word is represened by a vector. Two popular word embeddings are GloVe and fastText. The used GloVe and fastText pre-trained word embeddings here are from the following sources:\n",
    "\n",
    "* GloVe project website：https://nlp.stanford.edu/projects/glove/\n",
    "* fastText project website：https://fasttext.cc/\n",
    "\n",
    "Let us first import the following packages used in this example."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![](https://cdn-images-1.medium.com/max/1600/1*2r1yj0zPAuaSGZeQfG6Wtw.png)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-06-06T21:56:37.022037Z",
     "start_time": "2018-06-06T21:56:35.507701Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "import mxnet as mx\n",
    "from mxnet import gluon, nd\n",
    "import gluonnlp as nlp\n",
    "import re"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We pick a specific pre-trained embedding"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-06-06T21:56:42.177372Z",
     "start_time": "2018-06-06T21:56:39.926375Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "embedding = nlp.embedding.create('glove', source='glove.6B.50d')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-06-06T21:56:43.752190Z",
     "start_time": "2018-06-06T21:56:42.180013Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "vocab = nlp.Vocab(nlp.data.Counter(embedding.idx_to_token))\n",
    "vocab.set_embedding(embedding)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "Below shows the size of `vocab` including a special unknown token."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-06-06T21:56:43.761479Z",
     "start_time": "2018-06-06T21:56:43.754645Z"
    },
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "len(vocab.idx_to_token)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "We can access attributes of `vocab`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-06-06T21:56:43.772797Z",
     "start_time": "2018-06-06T21:56:43.764792Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "print(vocab['beautiful'])\n",
    "print(vocab.idx_to_token[71424])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![](support/cosinesimilarity.png)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-06-06T21:56:43.791341Z",
     "start_time": "2018-06-06T21:56:43.776953Z"
    },
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "def cos_sim(x, y):\n",
    "    return nd.dot(x, y) / (nd.norm(x) * nd.norm(y))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Word Similarity"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "Given an input word, we can find the nearest $k$ words from the vocabulary (400,000 words excluding the unknown token) by similarity. The similarity between any pair of words can be represented by the cosine similarity of their vectors."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-06-06T21:56:44.041379Z",
     "start_time": "2018-06-06T21:56:44.016299Z"
    },
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "outputs": [],
   "source": [
    "def norm_vecs_by_row(x):\n",
    "    return x / nd.sqrt(nd.sum(x * x, axis=1)).reshape((-1,1))\n",
    "\n",
    "def get_knn(vocab, k, word):\n",
    "    word_vec = vocab.embedding[word].reshape((-1, 1))\n",
    "    vocab_vecs = norm_vecs_by_row(vocab.embedding.idx_to_vec)\n",
    "    dot_prod = nd.dot(vocab_vecs[4:], word_vec)\n",
    "    indices = nd.topk(dot_prod.squeeze(), k=k+1, ret_typ='indices')\n",
    "    indices = [int(i.asscalar())+4 for i in indices]\n",
    "    # Remove unknown and input tokens.\n",
    "    return vocab.to_tokens(indices[1:])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "Let us find the 5 most similar words of 'baby' from the vocabulary (size: 400,000 words)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-06-06T21:56:44.508966Z",
     "start_time": "2018-06-06T21:56:44.044708Z"
    },
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "get_knn(vocab, 5, 'baby')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "We can verify the cosine similarity of vectors of 'baby' and 'babies'."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-06-06T21:56:44.520428Z",
     "start_time": "2018-06-06T21:56:44.511361Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "cos_sim(vocab.embedding['baby'], vocab.embedding['babies'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Let us find the 5 most similar words of 'run' from the vocabulary."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-06-06T21:56:45.464662Z",
     "start_time": "2018-06-06T21:56:44.957783Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "get_knn(vocab, 5, 'research')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "Let us find the 5 most similar words of 'beautiful' from the vocabulary."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-06-06T21:56:46.093124Z",
     "start_time": "2018-06-06T21:56:45.468022Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "get_knn(vocab, 5, 'computer')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Challenge**\n",
    "\n",
    "Try out the `get_knn` function with a word of your own"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Word Analogy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "We can also apply pre-trained word embeddings to the word analogy problem. For instance, \"man : woman :: son : daughter\" is an analogy. The word analogy completion problem is defined as: for analogy 'a : b :: c : d', given teh first three words 'a', 'b', 'c', find 'd'. The idea is to find the most similar word vector for vec('c') + (vec('b')-vec('a')).\n",
    "\n",
    "In this example, we will find words by analogy from the 400,000 indexed words in `vocab`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-06-06T21:56:46.158875Z",
     "start_time": "2018-06-06T21:56:46.103712Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "def get_top_k_by_analogy(vocab, k, word1, word2, word3):\n",
    "    word_vecs = vocab.embedding[word1, word2, word3]\n",
    "    \n",
    "    word_diff = (word_vecs[1] - word_vecs[0] + word_vecs[2])\n",
    "    \n",
    "    vocab_vecs = norm_vecs_by_row(vocab.embedding.idx_to_vec)\n",
    "    dot_prod = nd.dot(vocab_vecs[4:], word_diff.squeeze()).squeeze()\n",
    "    \n",
    "    indices = dot_prod.topk(k=k+1, ret_typ='indices')\n",
    "    indices = [int(i.asscalar())+4 for i in indices]\n",
    "    words = [w for w in vocab.to_tokens(indices) if w != word3]\n",
    "    return words[:k]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Semantic Analogy\n",
    "\n",
    "![analogy](https://user-images.githubusercontent.com/3716307/53924875-a1497880-4032-11e9-847c-2d826d0ee0ee.png)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-06-06T21:56:46.626485Z",
     "start_time": "2018-06-06T21:56:46.161792Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "get_top_k_by_analogy(vocab, 1, 'man', 'woman', 'son')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "source": [
    "Let us verify the cosine similarity between vec('son')+vec('woman')-vec('man') and vec('daughter')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-06-06T21:56:46.644371Z",
     "start_time": "2018-06-06T21:56:46.629075Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "def cos_sim_word_analogy(vocab, word1, word2, word3, word4):\n",
    "    words = [word1, word2, word3, word4]\n",
    "    vecs = vocab.embedding[words]\n",
    "    return cos_sim(vecs[1] - vecs[0] + vecs[2], vecs[3])\n",
    "\n",
    "cos_sim_word_analogy(vocab, 'man', 'woman', 'son', 'daughter')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-06-06T21:56:47.104740Z",
     "start_time": "2018-06-06T21:56:46.647576Z"
    },
    "scrolled": true,
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "outputs": [],
   "source": [
    "get_top_k_by_analogy(vocab, 1, 'celtics', 'nba', 'patriots')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "get_top_k_by_analogy(vocab, 1, 'france', 'football', 'india')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "get_top_k_by_analogy(vocab, 1, 'wine', 'red', 'sky')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "get_top_k_by_analogy(vocab, 1, 'russia', 'moscow', 'france')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Syntactic Analogy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-06-06T21:56:47.502780Z",
     "start_time": "2018-06-06T21:56:47.107420Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "get_top_k_by_analogy(vocab, 1, 'bad', 'worst', 'big')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2018-06-06T21:56:47.898396Z",
     "start_time": "2018-06-06T21:56:47.505389Z"
    },
    "slideshow": {
     "slide_type": "fragment"
    }
   },
   "outputs": [],
   "source": [
    "get_top_k_by_analogy(vocab, 1, 'do', 'did', 'go')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Challenge**\n",
    "\n",
    "write one semantic and one syntactic analogy using `get_top_k_by_analogy`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "slide"
    }
   },
   "source": [
    "### Application\n",
    "\n",
    "- Language Modelling\n",
    "- Neural Machine Translation\n",
    "- Text classification"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Language Modelling\n",
    "Generating text with a pre-trained language model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ctx = mx.gpu() if mx.context.num_gpus() > 0 else mx.cpu()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![language_modelling](https://user-images.githubusercontent.com/3716307/53928083-b4157a80-403d-11e9-950c-670299db4d5f.png)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lm_model, vocab = nlp.model.get_model(name='big_rnn_lm_2048_512',\n",
    "                                      dataset_name='gbw',\n",
    "                                      pretrained=True,\n",
    "                                      ctx=ctx)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We need a decoder function that can be called recursively by our sampler"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "class Decoder:\n",
    "    def __init__(self, model):\n",
    "        self.model = model\n",
    "    \n",
    "    def __call__(self, inputs, states):\n",
    "        outputs, states = self.model(mx.nd.expand_dims(inputs, axis=0), states)\n",
    "        return outputs[0], states\n",
    "\n",
    "decoder = Decoder(lm_model)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Instead of taking the most probable value for the next token, we sample from the distribution"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sampler = nlp.model.SequenceSampler(beam_size=1,\n",
    "                                    decoder=decoder,\n",
    "                                    eos_id=vocab['<eos>'],\n",
    "                                    max_length=20,\n",
    "                                    temperature=0.30)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Setting up the prompt for our language model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "prompt = \"This was the day ,\"\n",
    "prompt_tokenized = prompt.split(' ')\n",
    "bos_ids = [vocab[ele] for ele in prompt_tokenized]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Getting the initial states of the model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "init_states = lm_model.begin_state(batch_size=1, ctx=ctx)\n",
    "_, sampling_states = lm_model(mx.nd.expand_dims(mx.nd.array(bos_ids[:-1], ctx=ctx), axis=1), init_states)\n",
    "\n",
    "inputs = mx.nd.full(shape=(1,), ctx=ctx, val=bos_ids[-1])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "samples, _, valid_lengths = sampler(inputs, sampling_states)\n",
    "for sample, valid_length in zip(samples[0].asnumpy(), valid_lengths[0].asnumpy()):\n",
    "    sentence = prompt_tokenized[:-1] +[vocab.idx_to_token[ele] for ele in sample[:valid_length]]\n",
    "    print(' '.join(sentence))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Challenge**\n",
    "\n",
    "- Modify the prompt to generate sentences of your own.\n",
    "\n",
    "- Modify the temperature parameter to evaluate its impact on token sampling"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Slideshow",
  "kernelspec": {
   "display_name": "conda_mxnet_p36",
   "language": "python",
   "name": "conda_mxnet_p36"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}

## get_data.sh

#!/bin/bash

echo ". /home/ec2-user/anaconda3/etc/profile.d/conda.sh" >> ~/.bashrc
source ~/.bashrc
source activate mxnet_p36
sudo chown -R ec2-user /tmp
pip uninstall mxnet-cu90mkl -y
pip install mxnet-cu90mkl --user --pre --upgrade
pip install gluonnlp --user  --pre --upgrade
pip install gluoncv --user  --pre --upgrade

cd /home/ec2-user/SageMaker
git clone https://gist.github.com/ThomasDelteil/63a37d87bb14c7b98f0b4cd9a4167d32 GRT_gluon_toolkits
wget http://gluon-nlp.mxnet.io/_downloads/sentence_embedding.zip
unzip sentence_embedding.zip -d GRT_gluon_toolkits
rm sentence_embedding.zip
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Gluon-NLP"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"1. Pre-trained word embeddings\n",
	"2. Pre-trained language models\n",
	"3. Fine-tuning BERT \n",
	"\n",
	"http://gluon-nlp.mxnet.io/"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "slide"
	}
	},
	"source": [
	"## Pre-trained word embeddings"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "notes"
	}
	},
	"source": [
	"Here we introduce how to use pre-trained word embeddings, where each word is represened by a vector. Two popular word embeddings are GloVe and fastText. The used GloVe and fastText pre-trained word embeddings here are from the following sources:\n",
	"\n",
	"* GloVe project website：https://nlp.stanford.edu/projects/glove/\n",
	"* fastText project website：https://fasttext.cc/\n",
	"\n",
	"Let us first import the following packages used in this example."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"![](https://cdn-images-1.medium.com/max/1600/1*2r1yj0zPAuaSGZeQfG6Wtw.png)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"ExecuteTime": {
	"end_time": "2018-06-06T21:56:37.022037Z",
	"start_time": "2018-06-06T21:56:35.507701Z"
	},
	"slideshow": {
	"slide_type": "fragment"
	}
	},
	"outputs": [],
	"source": [
	"import mxnet as mx\n",
	"from mxnet import gluon, nd\n",
	"import gluonnlp as nlp\n",
	"import re"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"We pick a specific pre-trained embedding"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"ExecuteTime": {
	"end_time": "2018-06-06T21:56:42.177372Z",
	"start_time": "2018-06-06T21:56:39.926375Z"
	},
	"slideshow": {
	"slide_type": "fragment"
	}
	},
	"outputs": [],
	"source": [
	"embedding = nlp.embedding.create('glove', source='glove.6B.50d')"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"ExecuteTime": {
	"end_time": "2018-06-06T21:56:43.752190Z",
	"start_time": "2018-06-06T21:56:42.180013Z"
	},
	"slideshow": {
	"slide_type": "fragment"
	}
	},
	"outputs": [],
	"source": [
	"vocab = nlp.Vocab(nlp.data.Counter(embedding.idx_to_token))\n",
	"vocab.set_embedding(embedding)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "notes"
	}
	},
	"source": [
	"Below shows the size of `vocab` including a special unknown token."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"ExecuteTime": {
	"end_time": "2018-06-06T21:56:43.761479Z",
	"start_time": "2018-06-06T21:56:43.754645Z"
	},
	"slideshow": {
	"slide_type": "slide"
	}
	},
	"outputs": [],
	"source": [
	"len(vocab.idx_to_token)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "notes"
	}
	},
	"source": [
	"We can access attributes of `vocab`."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"ExecuteTime": {
	"end_time": "2018-06-06T21:56:43.772797Z",
	"start_time": "2018-06-06T21:56:43.764792Z"
	},
	"slideshow": {
	"slide_type": "fragment"
	}
	},
	"outputs": [],
	"source": [
	"print(vocab['beautiful'])\n",
	"print(vocab.idx_to_token[71424])"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"![](support/cosinesimilarity.png)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"ExecuteTime": {
	"end_time": "2018-06-06T21:56:43.791341Z",
	"start_time": "2018-06-06T21:56:43.776953Z"
	},
	"slideshow": {
	"slide_type": "slide"
	}
	},
	"outputs": [],
	"source": [
	"def cos_sim(x, y):\n",
	" return nd.dot(x, y) / (nd.norm(x) * nd.norm(y))"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "slide"
	}
	},
	"source": [
	"### Word Similarity"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "notes"
	}
	},
	"source": [
	"Given an input word, we can find the nearest $k$ words from the vocabulary (400,000 words excluding the unknown token) by similarity. The similarity between any pair of words can be represented by the cosine similarity of their vectors."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"ExecuteTime": {
	"end_time": "2018-06-06T21:56:44.041379Z",
	"start_time": "2018-06-06T21:56:44.016299Z"
	},
	"slideshow": {
	"slide_type": "notes"
	}
	},
	"outputs": [],
	"source": [
	"def norm_vecs_by_row(x):\n",
	" return x / nd.sqrt(nd.sum(x * x, axis=1)).reshape((-1,1))\n",
	"\n",
	"def get_knn(vocab, k, word):\n",
	" word_vec = vocab.embedding[word].reshape((-1, 1))\n",
	" vocab_vecs = norm_vecs_by_row(vocab.embedding.idx_to_vec)\n",
	" dot_prod = nd.dot(vocab_vecs[4:], word_vec)\n",
	" indices = nd.topk(dot_prod.squeeze(), k=k+1, ret_typ='indices')\n",
	" indices = [int(i.asscalar())+4 for i in indices]\n",
	" # Remove unknown and input tokens.\n",
	" return vocab.to_tokens(indices[1:])"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "notes"
	}
	},
	"source": [
	"Let us find the 5 most similar words of 'baby' from the vocabulary (size: 400,000 words)."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"ExecuteTime": {
	"end_time": "2018-06-06T21:56:44.508966Z",
	"start_time": "2018-06-06T21:56:44.044708Z"
	},
	"slideshow": {
	"slide_type": "slide"
	}
	},
	"outputs": [],
	"source": [
	"get_knn(vocab, 5, 'baby')"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "fragment"
	}
	},
	"source": [
	"We can verify the cosine similarity of vectors of 'baby' and 'babies'."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"ExecuteTime": {
	"end_time": "2018-06-06T21:56:44.520428Z",
	"start_time": "2018-06-06T21:56:44.511361Z"
	},
	"slideshow": {
	"slide_type": "fragment"
	}
	},
	"outputs": [],
	"source": [
	"cos_sim(vocab.embedding['baby'], vocab.embedding['babies'])"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "slide"
	}
	},
	"source": [
	"Let us find the 5 most similar words of 'run' from the vocabulary."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"ExecuteTime": {
	"end_time": "2018-06-06T21:56:45.464662Z",
	"start_time": "2018-06-06T21:56:44.957783Z"
	},
	"slideshow": {
	"slide_type": "fragment"
	}
	},
	"outputs": [],
	"source": [
	"get_knn(vocab, 5, 'research')"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "slide"
	}
	},
	"source": [
	"Let us find the 5 most similar words of 'beautiful' from the vocabulary."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"ExecuteTime": {
	"end_time": "2018-06-06T21:56:46.093124Z",
	"start_time": "2018-06-06T21:56:45.468022Z"
	},
	"slideshow": {
	"slide_type": "fragment"
	}
	},
	"outputs": [],
	"source": [
	"get_knn(vocab, 5, 'computer')"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Challenge\n",
	"\n",
	"Try out the `get_knn` function with a word of your own"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "slide"
	}
	},
	"source": [
	"### Word Analogy"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "notes"
	}
	},
	"source": [
	"We can also apply pre-trained word embeddings to the word analogy problem. For instance, \"man : woman :: son : daughter\" is an analogy. The word analogy completion problem is defined as: for analogy 'a : b :: c : d', given teh first three words 'a', 'b', 'c', find 'd'. The idea is to find the most similar word vector for vec('c') + (vec('b')-vec('a')).\n",
	"\n",
	"In this example, we will find words by analogy from the 400,000 indexed words in `vocab`."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"ExecuteTime": {
	"end_time": "2018-06-06T21:56:46.158875Z",
	"start_time": "2018-06-06T21:56:46.103712Z"
	},
	"slideshow": {
	"slide_type": "fragment"
	}
	},
	"outputs": [],
	"source": [
	"def get_top_k_by_analogy(vocab, k, word1, word2, word3):\n",
	" word_vecs = vocab.embedding[word1, word2, word3]\n",
	" \n",
	" word_diff = (word_vecs[1] - word_vecs[0] + word_vecs[2])\n",
	" \n",
	" vocab_vecs = norm_vecs_by_row(vocab.embedding.idx_to_vec)\n",
	" dot_prod = nd.dot(vocab_vecs[4:], word_diff.squeeze()).squeeze()\n",
	" \n",
	" indices = dot_prod.topk(k=k+1, ret_typ='indices')\n",
	" indices = [int(i.asscalar())+4 for i in indices]\n",
	" words = [w for w in vocab.to_tokens(indices) if w != word3]\n",
	" return words[:k]"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "slide"
	}
	},
	"source": [
	"### Semantic Analogy\n",
	"\n",
	"![analogy](https://user-images.githubusercontent.com/3716307/53924875-a1497880-4032-11e9-847c-2d826d0ee0ee.png)\n"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"ExecuteTime": {
	"end_time": "2018-06-06T21:56:46.626485Z",
	"start_time": "2018-06-06T21:56:46.161792Z"
	},
	"slideshow": {
	"slide_type": "fragment"
	}
	},
	"outputs": [],
	"source": [
	"get_top_k_by_analogy(vocab, 1, 'man', 'woman', 'son')"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "fragment"
	}
	},
	"source": [
	"Let us verify the cosine similarity between vec('son')+vec('woman')-vec('man') and vec('daughter')"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"ExecuteTime": {
	"end_time": "2018-06-06T21:56:46.644371Z",
	"start_time": "2018-06-06T21:56:46.629075Z"
	},
	"slideshow": {
	"slide_type": "fragment"
	}
	},
	"outputs": [],
	"source": [
	"def cos_sim_word_analogy(vocab, word1, word2, word3, word4):\n",
	" words = [word1, word2, word3, word4]\n",
	" vecs = vocab.embedding[words]\n",
	" return cos_sim(vecs[1] - vecs[0] + vecs[2], vecs[3])\n",
	"\n",
	"cos_sim_word_analogy(vocab, 'man', 'woman', 'son', 'daughter')"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"ExecuteTime": {
	"end_time": "2018-06-06T21:56:47.104740Z",
	"start_time": "2018-06-06T21:56:46.647576Z"
	},
	"scrolled": true,
	"slideshow": {
	"slide_type": "slide"
	}
	},
	"outputs": [],
	"source": [
	"get_top_k_by_analogy(vocab, 1, 'celtics', 'nba', 'patriots')"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"slideshow": {
	"slide_type": "fragment"
	}
	},
	"outputs": [],
	"source": [
	"get_top_k_by_analogy(vocab, 1, 'france', 'football', 'india')"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"slideshow": {
	"slide_type": "fragment"
	}
	},
	"outputs": [],
	"source": [
	"get_top_k_by_analogy(vocab, 1, 'wine', 'red', 'sky')"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"get_top_k_by_analogy(vocab, 1, 'russia', 'moscow', 'france')"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "slide"
	}
	},
	"source": [
	"### Syntactic Analogy"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"ExecuteTime": {
	"end_time": "2018-06-06T21:56:47.502780Z",
	"start_time": "2018-06-06T21:56:47.107420Z"
	},
	"slideshow": {
	"slide_type": "fragment"
	}
	},
	"outputs": [],
	"source": [
	"get_top_k_by_analogy(vocab, 1, 'bad', 'worst', 'big')"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"ExecuteTime": {
	"end_time": "2018-06-06T21:56:47.898396Z",
	"start_time": "2018-06-06T21:56:47.505389Z"
	},
	"slideshow": {
	"slide_type": "fragment"
	}
	},
	"outputs": [],
	"source": [
	"get_top_k_by_analogy(vocab, 1, 'do', 'did', 'go')"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Challenge\n",
	"\n",
	"write one semantic and one syntactic analogy using `get_top_k_by_analogy`"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": []
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"slideshow": {
	"slide_type": "slide"
	}
	},
	"source": [
	"### Application\n",
	"\n",
	"- Language Modelling\n",
	"- Neural Machine Translation\n",
	"- Text classification"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Language Modelling\n",
	"Generating text with a pre-trained language model"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"ctx = mx.gpu() if mx.context.num_gpus() > 0 else mx.cpu()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"![language_modelling](https://user-images.githubusercontent.com/3716307/53928083-b4157a80-403d-11e9-950c-670299db4d5f.png)\n"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"lm_model, vocab = nlp.model.get_model(name='big_rnn_lm_2048_512',\n",
	" dataset_name='gbw',\n",
	" pretrained=True,\n",
	" ctx=ctx)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"We need a decoder function that can be called recursively by our sampler"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"class Decoder:\n",
	" def __init__(self, model):\n",
	" self.model = model\n",
	" \n",
	" def __call__(self, inputs, states):\n",
	" outputs, states = self.model(mx.nd.expand_dims(inputs, axis=0), states)\n",
	" return outputs[0], states\n",
	"\n",
	"decoder = Decoder(lm_model)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Instead of taking the most probable value for the next token, we sample from the distribution"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"sampler = nlp.model.SequenceSampler(beam_size=1,\n",
	" decoder=decoder,\n",
	" eos_id=vocab['<eos>'],\n",
	" max_length=20,\n",
	" temperature=0.30)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Setting up the prompt for our language model"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"prompt = \"This was the day ,\"\n",
	"prompt_tokenized = prompt.split(' ')\n",
	"bos_ids = [vocab[ele] for ele in prompt_tokenized]"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Getting the initial states of the model"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"init_states = lm_model.begin_state(batch_size=1, ctx=ctx)\n",
	"_, sampling_states = lm_model(mx.nd.expand_dims(mx.nd.array(bos_ids[:-1], ctx=ctx), axis=1), init_states)\n",
	"\n",
	"inputs = mx.nd.full(shape=(1,), ctx=ctx, val=bos_ids[-1])"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {},
	"outputs": [],
	"source": [
	"samples, _, valid_lengths = sampler(inputs, sampling_states)\n",
	"for sample, valid_length in zip(samples[0].asnumpy(), valid_lengths[0].asnumpy()):\n",
	" sentence = prompt_tokenized[:-1] +[vocab.idx_to_token[ele] for ele in sample[:valid_length]]\n",
	" print(' '.join(sentence))"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Challenge\n",
	"\n",
	"- Modify the prompt to generate sentences of your own.\n",
	"\n",
	"- Modify the temperature parameter to evaluate its impact on token sampling"
	]
	}
	],
	"metadata": {
	"celltoolbar": "Slideshow",
	"kernelspec": {
	"display_name": "conda_mxnet_p36",
	"language": "python",
	"name": "conda_mxnet_p36"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.6.5"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2
	}

	#!/bin/bash

	echo ". /home/ec2-user/anaconda3/etc/profile.d/conda.sh" >> ~/.bashrc
	source ~/.bashrc
	source activate mxnet_p36
	sudo chown -R ec2-user /tmp
	pip uninstall mxnet-cu90mkl -y
	pip install mxnet-cu90mkl --user --pre --upgrade
	pip install gluonnlp --user --pre --upgrade
	pip install gluoncv --user --pre --upgrade

	cd /home/ec2-user/SageMaker
	git clone https://gist.github.com/ThomasDelteil/63a37d87bb14c7b98f0b4cd9a4167d32 GRT_gluon_toolkits
	wget http://gluon-nlp.mxnet.io/_downloads/sentence_embedding.zip
	unzip sentence_embedding.zip -d GRT_gluon_toolkits
	rm sentence_embedding.zip