debasishg/Embedding and Tokenizer in Keras.ipynb

## Embedding and Tokenizer in Keras.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Embedding and Tokenizer in Keras"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "[Keras](https://keras.io) has some classes targetting NLP and preprocessing text but it's not directly clear from the documentation and samples what they do and how they work. So I looked a bit deeper at the source code and used simple examples to expose what is going on."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Tokenizer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `Tokenizer` class in Keras has various methods which help to prepare text so it can be used in neural network models. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "from keras.preprocessing.text import Tokenizer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The top-n words `nb_words` will not truncate the words found in the input but it will truncate the usage. Here we take only the top three words:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "nb_words = 3\n",
    "tokenizer = Tokenizer(num_words=nb_words)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The training phase is by means of the `fit_on_texts` method and you can see the word index using the `word_index` property:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'is': 1, 'in': 2, 'the': 3, 'sun': 4, 'shining': 5, 'june': 6, 'september': 7, 'grey': 8, 'life': 9, 'beautiful': 10, 'august': 11, 'i': 12, 'like': 13, 'it': 14, 'this': 15, 'and': 16, 'other': 17, 'things': 18}\n"
     ]
    }
   ],
   "source": [
    "tokenizer.fit_on_texts([\"The sun is shining in June!\",\"September is grey.\",\"Life is beautiful in August.\",\"I like it\",\"This and other things?\"])\n",
    "print(tokenizer.word_index)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that there is a basic filtering of text annotations (exclamation marks and such).\n",
    "\n",
    "You can see that the value 3 is clearly not respected in the sense of limiting the dictionary. It is respected however in the `texts_to_sequences` method which turns input into numerical arrays:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[[1]]"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tokenizer.texts_to_sequences([\"June is beautiful and I like it!\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You need to read this as: take only words with an index less or equal to 3 (the constructor parameter). A parameter-less constructor yields the full sequences:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'is': 1, 'in': 2, 'the': 3, 'sun': 4, 'shining': 5, 'june': 6, 'september': 7, 'grey': 8, 'life': 9, 'beautiful': 10, 'august': 11, 'i': 12, 'like': 13, 'it': 14, 'this': 15, 'and': 16, 'other': 17, 'things': 18}\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[[6, 1, 10, 16, 12, 13, 14]]"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tokenizer = Tokenizer()\n",
    "texts = [\"The sun is shining in June!\",\"September is grey.\",\"Life is beautiful in August.\",\"I like it\",\"This and other things?\"]\n",
    "tokenizer.fit_on_texts(texts)\n",
    "print(tokenizer.word_index)\n",
    "tokenizer.texts_to_sequences([\"June is beautiful and I like it!\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are various properties of the tokenizer which can be helpful during development of a network. For example, the stats of the training:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "OrderedDict([('the', 1), ('sun', 1), ('is', 3), ('shining', 1), ('in', 2), ('june', 1), ('september', 1), ('grey', 1), ('life', 1), ('beautiful', 1), ('august', 1), ('i', 1), ('like', 1), ('it', 1), ('this', 1), ('and', 1), ('other', 1), ('things', 1)])\n"
     ]
    }
   ],
   "source": [
    "print(tokenizer.word_counts)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "or whether lower-casing was applied and how many sentences were used to train:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Was lower-case applied to 5 sentences?: True\n"
     ]
    }
   ],
   "source": [
    "print(\"Was lower-case applied to %s sentences?: %s\"%(tokenizer.document_count,tokenizer.lower))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you want to feed sentences to a network you can't use arrays of variable lengths, corresponding to variable length sentences. So, the trick is to use the `texts_to_matrix` method to convert the sentences directly to equal size arrays:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 1., 0., 1., 1., 1., 0.,\n",
       "        1., 0., 0.],\n",
       "       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0.,\n",
       "        0., 0., 0.]])"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tokenizer.texts_to_matrix([\"June is beautiful and I like it!\",\"Like August\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This creates two rows for two sentences versus the amount of words in the vocabulary."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now you can go ahead and use networks to do stuff. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Basic network with textual data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For example, let's say you want to detect the word 'shining' in the sequences above.\n",
    "The most basic way would be to use a layer with some nodes like so:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<keras.callbacks.History at 0x11a2bc4e0>"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from keras.preprocessing import sequence\n",
    "from keras.models import Sequential\n",
    "from keras.layers.core import Dense, Activation, Flatten\n",
    "from keras.layers.wrappers import TimeDistributed\n",
    "from keras.layers.embeddings import Embedding\n",
    "from keras.layers.recurrent import LSTM\n",
    "\n",
    "X = tokenizer.texts_to_matrix(texts)\n",
    "y = [1,0,0,0,0]\n",
    "\n",
    "vocab_size = len(tokenizer.word_index) + 1\n",
    "\n",
    "model = Sequential()\n",
    "model.add(Dense(2, input_dim=vocab_size))\n",
    "model.add(Dense(1, activation='sigmoid'))\n",
    " \n",
    "\n",
    "model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics = ['accuracy'])\n",
    "\n",
    "model.fit(X, y=y, batch_size=200, epochs=700, verbose=0, validation_split=0.2, shuffle=True)\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can check that this indeed learned the word:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[1.],\n",
       "       [0.],\n",
       "       [0.],\n",
       "       [0.],\n",
       "       [0.]], dtype=float32)"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from keras.utils.np_utils import np as np\n",
    "np.round(model.predict(X))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also do more sophisticated things. If the vocabulary is very large the numerical sequences turn into sparse arrays and it's more efficient to cast everything to a lower dimension with the `Embedding` layer. \n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Embedding\n",
    "\n",
    "How does embedding work? An example demonstrates best what is going on.\n",
    "\n",
    "Assume you have a sparse vector `[0,1,0,1,1,0,0]` of dimension seven. You can turn it into a non-sparse 2d vector like so:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[[0.00012293, 0.00340027],\n",
       "        [0.02443272, 0.01070259],\n",
       "        [0.00012293, 0.00340027],\n",
       "        [0.02443272, 0.01070259],\n",
       "        [0.02443272, 0.01070259],\n",
       "        [0.00012293, 0.00340027],\n",
       "        [0.00012293, 0.00340027]]], dtype=float32)"
      ]
     },
     "execution_count": 29,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model = Sequential()\n",
    "model.add(Embedding(2, 2, input_length=7))\n",
    "model.compile('rmsprop', 'mse')\n",
    "model.predict(np.array([[0,1,0,1,1,0,0]]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Where do these numbers come from? It's a simple map from the given range to a 2d space:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "ename": "AttributeError",
     "evalue": "'Embedding' object has no attribute 'W'",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mAttributeError\u001b[0m                            Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-34-a30685f7fbf5>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mmodel\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mlayers\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mW\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_value\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[0;31mAttributeError\u001b[0m: 'Embedding' object has no attribute 'W'"
     ]
    }
   ],
   "source": [
    "model.layers[0].W.get_value()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The 0-value is mapped to the first index and the 1-value to the second as can be seen by comparing the two arrays. The first value of the `Embedding` constructor is the range of values in the input. In the example it's 2 because we give a binary vector as input. The second value is the target dimension. The third is the length of the vectors we give. \n",
    "So, there is nothing magical in this, merely a mapping from integers to floats."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now back to our 'shining' detection. The training data looks like a sequences of bits:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 0.,  1.,  1.,  1.,  0.,  1.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,\n",
       "         1.,  0.,  0.,  0.,  0.,  0.],\n",
       "       [ 0.,  1.,  0.,  0.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,\n",
       "         0.,  0.,  0.,  0.,  0.,  0.],\n",
       "       [ 0.,  1.,  1.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,\n",
       "         0.,  0.,  1.,  0.,  1.,  0.],\n",
       "       [ 0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  0.,  0.,\n",
       "         0.,  1.,  0.,  1.,  0.,  0.],\n",
       "       [ 0.,  0.,  0.,  0.,  0.,  0.,  1.,  0.,  1.,  0.,  1.,  0.,  0.,\n",
       "         0.,  0.,  0.,  0.,  0.,  1.]])"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "X"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you want to use the embedding it means that the output of the embedding layer will have dimension (5, 19, 10). This works well with LSTM or GRU (see below) but if you want a binary classifier you need to flatten this to (5, 19*10):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<keras.callbacks.History at 0x11aaa4908>"
      ]
     },
     "execution_count": 36,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model = Sequential()\n",
    "model.add(Embedding(3, 10, input_length= X.shape[1] ))\n",
    "model.add(Flatten())\n",
    "model.add(Dense(1, activation='sigmoid'))\n",
    "model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics = ['accuracy'])\n",
    "model.fit(X, y=y, batch_size=200, epochs=700, verbose=0, validation_split=0.2, shuffle=True)\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It detects 'shining' flawlessly:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[1.0000000e+00],\n",
       "       [7.1268730e-08],\n",
       "       [9.9724026e-08],\n",
       "       [1.1465622e-08],\n",
       "       [9.7872239e-01]], dtype=float32)"
      ]
     },
     "execution_count": 37,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.predict(X)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "An LSTM layer has historical memory and so the dimension outputted by the embedding works in this case, no need to flatten things:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<keras.callbacks.History at 0x11aefc4a8>"
      ]
     },
     "execution_count": 38,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model = Sequential()\n",
    "\n",
    "model.add(Embedding(vocab_size, 10))\n",
    "model.add(LSTM(5))\n",
    "model.add(Dense(1, activation='sigmoid'))\n",
    "model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics = ['accuracy'])\n",
    "model.fit(X, y=y,  epochs=500, verbose=0, validation_split=0.2, shuffle=True)\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Obviously, it predicts things as well:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[0.97908396],\n",
       "       [0.0102849 ],\n",
       "       [0.01028524],\n",
       "       [0.01028495],\n",
       "       [0.01963747]], dtype=float32)"
      ]
     },
     "execution_count": 39,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.predict(X)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using word2vec "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There is much confusion about whether the `Embedding` in Keras is like word2vec and how word2vec can be used together with Keras.\n",
    "I hope that the simple example above has made clear that the `Embedding` class does indeed map discrete labels (i.e. words) into a continuous vector space. It should be just as clear that this embedding **does not** in any way take the semantic similarity of the words into account. Check [the source code](https://github.com/fchollet/keras/blob/master/keras/layers/embeddings.py) if want to see it even more clearly.\n",
    "\n",
    "So if word2vec does bring along some extra info into the game how can you use it together with Keras?\n",
    "\n",
    "The idea is that instead of mapping sequences of integer numbers to sequences of floats happens in a way which preserves the semantic affinity. There are various pretrained word2vec datasets on the net, we'll [GloVe](http://nlp.stanford.edu/projects/glove/) since it's small and straightforward but check out [the Google repo](https://code.google.com/archive/p/word2vec/) as well.\n",
    "\n",
    "Loading the GloVe set is straightforward:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Loaded 400000 word vectors.\n"
     ]
    }
   ],
   "source": [
    "embeddings_index = {}\n",
    "glove_data = './glove.6B.50d.txt'\n",
    "f = open(glove_data)\n",
    "for line in f:\n",
    "    values = line.split()\n",
    "    word = values[0]\n",
    "    value = np.asarray(values[1:], dtype='float32')\n",
    "    embeddings_index[word] = value\n",
    "f.close()\n",
    "\n",
    "print('Loaded %s word vectors.' % len(embeddings_index))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [],
   "source": [
    "embedding_dimension = 10\n",
    "word_index = tokenizer.word_index"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `embedding_matrix` matrix maps words to vectors in the specified embedding dimension (here 100):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 44,
   "metadata": {},
   "outputs": [],
   "source": [
    "embedding_matrix = np.zeros((len(word_index) + 1, embedding_dimension))\n",
    "for word, i in word_index.items():\n",
    "    embedding_vector = embeddings_index.get(word)\n",
    "    if embedding_vector is not None:\n",
    "        # words not found in embedding index will be all-zeros.\n",
    "        embedding_matrix[i] = embedding_vector[:embedding_dimension]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now you have an embedding matrix of 19 words into dimension 10:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 45,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(19, 10)"
      ]
     },
     "execution_count": 45,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "embedding_matrix.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [],
   "source": [
    "embedding_layer = Embedding(embedding_matrix.shape[0],\n",
    "                            embedding_matrix.shape[1],\n",
    "                            weights=[embedding_matrix],\n",
    "                            input_length=12)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In order to use this new embedding you need to reshape the training data `X` to the basic word-to-index sequences:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [],
   "source": [
    "from keras.preprocessing.sequence import pad_sequences\n",
    "X = tokenizer.texts_to_sequences(texts)\n",
    "X = pad_sequences(X, maxlen=12)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We have used a fixed size of 12 here but anything works really. Now the sequences with integers representing word-index are mapped to a 10-dimensional vector space using the wrod2vec embedding and we're good to go:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<keras.callbacks.History at 0x1217f7d30>"
      ]
     },
     "execution_count": 48,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model = Sequential()\n",
    "model.add(embedding_layer)\n",
    "model.add(Flatten())\n",
    "model.add(Dense(1, activation='sigmoid'))\n",
    "model.layers[0].trainable=False\n",
    "model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics = ['accuracy'])\n",
    "model.fit(X, y=y, batch_size=20, epochs=700, verbose=0, validation_split=0.2, shuffle=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You get the same razor sharp prediction. I know all of the above networks are overkill for the simple datasets but the intention was to show you the way to use the various NLP functionalities."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[9.9771714e-01],\n",
       "       [2.9435821e-04],\n",
       "       [2.2714464e-03],\n",
       "       [1.3460572e-03],\n",
       "       [1.2431095e-02]], dtype=float32)"
      ]
     },
     "execution_count": 49,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.predict(X)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Embedding and Tokenizer in Keras"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"[Keras](https://keras.io) has some classes targetting NLP and preprocessing text but it's not directly clear from the documentation and samples what they do and how they work. So I looked a bit deeper at the source code and used simple examples to expose what is going on."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Tokenizer"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"The `Tokenizer` class in Keras has various methods which help to prepare text so it can be used in neural network models. "
	]
	},
	{
	"cell_type": "code",
	"execution_count": 19,
	"metadata": {},
	"outputs": [],
	"source": [
	"from keras.preprocessing.text import Tokenizer"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"The top-n words `nb_words` will not truncate the words found in the input but it will truncate the usage. Here we take only the top three words:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 20,
	"metadata": {},
	"outputs": [],
	"source": [
	"nb_words = 3\n",
	"tokenizer = Tokenizer(num_words=nb_words)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"The training phase is by means of the `fit_on_texts` method and you can see the word index using the `word_index` property:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 21,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"{'is': 1, 'in': 2, 'the': 3, 'sun': 4, 'shining': 5, 'june': 6, 'september': 7, 'grey': 8, 'life': 9, 'beautiful': 10, 'august': 11, 'i': 12, 'like': 13, 'it': 14, 'this': 15, 'and': 16, 'other': 17, 'things': 18}\n"
	]
	}
	],
	"source": [
	"tokenizer.fit_on_texts([\"The sun is shining in June!\",\"September is grey.\",\"Life is beautiful in August.\",\"I like it\",\"This and other things?\"])\n",
	"print(tokenizer.word_index)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Note that there is a basic filtering of text annotations (exclamation marks and such).\n",
	"\n",
	"You can see that the value 3 is clearly not respected in the sense of limiting the dictionary. It is respected however in the `texts_to_sequences` method which turns input into numerical arrays:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 22,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"[[1]]"
	]
	},
	"execution_count": 22,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"tokenizer.texts_to_sequences([\"June is beautiful and I like it!\"])"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"You need to read this as: take only words with an index less or equal to 3 (the constructor parameter). A parameter-less constructor yields the full sequences:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 23,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"{'is': 1, 'in': 2, 'the': 3, 'sun': 4, 'shining': 5, 'june': 6, 'september': 7, 'grey': 8, 'life': 9, 'beautiful': 10, 'august': 11, 'i': 12, 'like': 13, 'it': 14, 'this': 15, 'and': 16, 'other': 17, 'things': 18}\n"
	]
	},
	{
	"data": {
	"text/plain": [
	"[[6, 1, 10, 16, 12, 13, 14]]"
	]
	},
	"execution_count": 23,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"tokenizer = Tokenizer()\n",
	"texts = [\"The sun is shining in June!\",\"September is grey.\",\"Life is beautiful in August.\",\"I like it\",\"This and other things?\"]\n",
	"tokenizer.fit_on_texts(texts)\n",
	"print(tokenizer.word_index)\n",
	"tokenizer.texts_to_sequences([\"June is beautiful and I like it!\"])"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"There are various properties of the tokenizer which can be helpful during development of a network. For example, the stats of the training:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 24,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"OrderedDict([('the', 1), ('sun', 1), ('is', 3), ('shining', 1), ('in', 2), ('june', 1), ('september', 1), ('grey', 1), ('life', 1), ('beautiful', 1), ('august', 1), ('i', 1), ('like', 1), ('it', 1), ('this', 1), ('and', 1), ('other', 1), ('things', 1)])\n"
	]
	}
	],
	"source": [
	"print(tokenizer.word_counts)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"or whether lower-casing was applied and how many sentences were used to train:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 25,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Was lower-case applied to 5 sentences?: True\n"
	]
	}
	],
	"source": [
	"print(\"Was lower-case applied to %s sentences?: %s\"%(tokenizer.document_count,tokenizer.lower))"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"If you want to feed sentences to a network you can't use arrays of variable lengths, corresponding to variable length sentences. So, the trick is to use the `texts_to_matrix` method to convert the sentences directly to equal size arrays:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 26,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"array([[0., 1., 0., 0., 0., 0., 1., 0., 0., 0., 1., 0., 1., 1., 1., 0.,\n",
	" 1., 0., 0.],\n",
	" [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 0.,\n",
	" 0., 0., 0.]])"
	]
	},
	"execution_count": 26,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"tokenizer.texts_to_matrix([\"June is beautiful and I like it!\",\"Like August\"])"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"This creates two rows for two sentences versus the amount of words in the vocabulary."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Now you can go ahead and use networks to do stuff. "
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Basic network with textual data"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"For example, let's say you want to detect the word 'shining' in the sequences above.\n",
	"The most basic way would be to use a layer with some nodes like so:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 27,
	"metadata": {
	"scrolled": true
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"<keras.callbacks.History at 0x11a2bc4e0>"
	]
	},
	"execution_count": 27,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"from keras.preprocessing import sequence\n",
	"from keras.models import Sequential\n",
	"from keras.layers.core import Dense, Activation, Flatten\n",
	"from keras.layers.wrappers import TimeDistributed\n",
	"from keras.layers.embeddings import Embedding\n",
	"from keras.layers.recurrent import LSTM\n",
	"\n",
	"X = tokenizer.texts_to_matrix(texts)\n",
	"y = [1,0,0,0,0]\n",
	"\n",
	"vocab_size = len(tokenizer.word_index) + 1\n",
	"\n",
	"model = Sequential()\n",
	"model.add(Dense(2, input_dim=vocab_size))\n",
	"model.add(Dense(1, activation='sigmoid'))\n",
	" \n",
	"\n",
	"model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics = ['accuracy'])\n",
	"\n",
	"model.fit(X, y=y, batch_size=200, epochs=700, verbose=0, validation_split=0.2, shuffle=True)\n",
	"\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"You can check that this indeed learned the word:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 28,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"array([[1.],\n",
	" [0.],\n",
	" [0.],\n",
	" [0.],\n",
	" [0.]], dtype=float32)"
	]
	},
	"execution_count": 28,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"from keras.utils.np_utils import np as np\n",
	"np.round(model.predict(X))"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"You can also do more sophisticated things. If the vocabulary is very large the numerical sequences turn into sparse arrays and it's more efficient to cast everything to a lower dimension with the `Embedding` layer. \n",
	"\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Embedding\n",
	"\n",
	"How does embedding work? An example demonstrates best what is going on.\n",
	"\n",
	"Assume you have a sparse vector `[0,1,0,1,1,0,0]` of dimension seven. You can turn it into a non-sparse 2d vector like so:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 29,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"array([[[0.00012293, 0.00340027],\n",
	" [0.02443272, 0.01070259],\n",
	" [0.00012293, 0.00340027],\n",
	" [0.02443272, 0.01070259],\n",
	" [0.02443272, 0.01070259],\n",
	" [0.00012293, 0.00340027],\n",
	" [0.00012293, 0.00340027]]], dtype=float32)"
	]
	},
	"execution_count": 29,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"model = Sequential()\n",
	"model.add(Embedding(2, 2, input_length=7))\n",
	"model.compile('rmsprop', 'mse')\n",
	"model.predict(np.array([[0,1,0,1,1,0,0]]))"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Where do these numbers come from? It's a simple map from the given range to a 2d space:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 34,
	"metadata": {},
	"outputs": [
	{
	"ename": "AttributeError",
	"evalue": "'Embedding' object has no attribute 'W'",
	"output_type": "error",
	"traceback": [
	"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
	"\u001b[0;31mAttributeError\u001b[0m Traceback (most recent call last)",
	"\u001b[0;32m<ipython-input-34-a30685f7fbf5>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[0;32m----> 1\u001b[0;31m \u001b[0mmodel\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mlayers\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mW\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mget_value\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
	"\u001b[0;31mAttributeError\u001b[0m: 'Embedding' object has no attribute 'W'"
	]
	}
	],
	"source": [
	"model.layers[0].W.get_value()"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"The 0-value is mapped to the first index and the 1-value to the second as can be seen by comparing the two arrays. The first value of the `Embedding` constructor is the range of values in the input. In the example it's 2 because we give a binary vector as input. The second value is the target dimension. The third is the length of the vectors we give. \n",
	"So, there is nothing magical in this, merely a mapping from integers to floats."
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Now back to our 'shining' detection. The training data looks like a sequences of bits:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 13,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"array([[ 0., 1., 1., 1., 0., 1., 0., 1., 0., 0., 0., 0., 0.,\n",
	" 1., 0., 0., 0., 0., 0.],\n",
	" [ 0., 1., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 1.,\n",
	" 0., 0., 0., 0., 0., 0.],\n",
	" [ 0., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.,\n",
	" 0., 0., 1., 0., 1., 0.],\n",
	" [ 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.,\n",
	" 0., 1., 0., 1., 0., 0.],\n",
	" [ 0., 0., 0., 0., 0., 0., 1., 0., 1., 0., 1., 0., 0.,\n",
	" 0., 0., 0., 0., 0., 1.]])"
	]
	},
	"execution_count": 13,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"X"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"If you want to use the embedding it means that the output of the embedding layer will have dimension (5, 19, 10). This works well with LSTM or GRU (see below) but if you want a binary classifier you need to flatten this to (5, 19*10):"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 36,
	"metadata": {
	"scrolled": true
	},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"<keras.callbacks.History at 0x11aaa4908>"
	]
	},
	"execution_count": 36,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"model = Sequential()\n",
	"model.add(Embedding(3, 10, input_length= X.shape[1] ))\n",
	"model.add(Flatten())\n",
	"model.add(Dense(1, activation='sigmoid'))\n",
	"model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics = ['accuracy'])\n",
	"model.fit(X, y=y, batch_size=200, epochs=700, verbose=0, validation_split=0.2, shuffle=True)\n",
	"\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"It detects 'shining' flawlessly:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 37,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"array([[1.0000000e+00],\n",
	" [7.1268730e-08],\n",
	" [9.9724026e-08],\n",
	" [1.1465622e-08],\n",
	" [9.7872239e-01]], dtype=float32)"
	]
	},
	"execution_count": 37,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"model.predict(X)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"An LSTM layer has historical memory and so the dimension outputted by the embedding works in this case, no need to flatten things:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 38,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"<keras.callbacks.History at 0x11aefc4a8>"
	]
	},
	"execution_count": 38,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"model = Sequential()\n",
	"\n",
	"model.add(Embedding(vocab_size, 10))\n",
	"model.add(LSTM(5))\n",
	"model.add(Dense(1, activation='sigmoid'))\n",
	"model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics = ['accuracy'])\n",
	"model.fit(X, y=y, epochs=500, verbose=0, validation_split=0.2, shuffle=True)\n",
	"\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Obviously, it predicts things as well:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 39,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"array([[0.97908396],\n",
	" [0.0102849 ],\n",
	" [0.01028524],\n",
	" [0.01028495],\n",
	" [0.01963747]], dtype=float32)"
	]
	},
	"execution_count": 39,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"model.predict(X)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Using word2vec "
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"There is much confusion about whether the `Embedding` in Keras is like word2vec and how word2vec can be used together with Keras.\n",
	"I hope that the simple example above has made clear that the `Embedding` class does indeed map discrete labels (i.e. words) into a continuous vector space. It should be just as clear that this embedding does not in any way take the semantic similarity of the words into account. Check [the source code](https://github.com/fchollet/keras/blob/master/keras/layers/embeddings.py) if want to see it even more clearly.\n",
	"\n",
	"So if word2vec does bring along some extra info into the game how can you use it together with Keras?\n",
	"\n",
	"The idea is that instead of mapping sequences of integer numbers to sequences of floats happens in a way which preserves the semantic affinity. There are various pretrained word2vec datasets on the net, we'll [GloVe](http://nlp.stanford.edu/projects/glove/) since it's small and straightforward but check out [the Google repo](https://code.google.com/archive/p/word2vec/) as well.\n",
	"\n",
	"Loading the GloVe set is straightforward:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 42,
	"metadata": {},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"Loaded 400000 word vectors.\n"
	]
	}
	],
	"source": [
	"embeddings_index = {}\n",
	"glove_data = './glove.6B.50d.txt'\n",
	"f = open(glove_data)\n",
	"for line in f:\n",
	" values = line.split()\n",
	" word = values[0]\n",
	" value = np.asarray(values[1:], dtype='float32')\n",
	" embeddings_index[word] = value\n",
	"f.close()\n",
	"\n",
	"print('Loaded %s word vectors.' % len(embeddings_index))"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": []
	},
	{
	"cell_type": "code",
	"execution_count": 43,
	"metadata": {},
	"outputs": [],
	"source": [
	"embedding_dimension = 10\n",
	"word_index = tokenizer.word_index"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"The `embedding_matrix` matrix maps words to vectors in the specified embedding dimension (here 100):"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 44,
	"metadata": {},
	"outputs": [],
	"source": [
	"embedding_matrix = np.zeros((len(word_index) + 1, embedding_dimension))\n",
	"for word, i in word_index.items():\n",
	" embedding_vector = embeddings_index.get(word)\n",
	" if embedding_vector is not None:\n",
	" # words not found in embedding index will be all-zeros.\n",
	" embedding_matrix[i] = embedding_vector[:embedding_dimension]"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Now you have an embedding matrix of 19 words into dimension 10:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 45,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"(19, 10)"
	]
	},
	"execution_count": 45,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"embedding_matrix.shape"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 46,
	"metadata": {},
	"outputs": [],
	"source": [
	"embedding_layer = Embedding(embedding_matrix.shape[0],\n",
	" embedding_matrix.shape[1],\n",
	" weights=[embedding_matrix],\n",
	" input_length=12)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"In order to use this new embedding you need to reshape the training data `X` to the basic word-to-index sequences:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 47,
	"metadata": {},
	"outputs": [],
	"source": [
	"from keras.preprocessing.sequence import pad_sequences\n",
	"X = tokenizer.texts_to_sequences(texts)\n",
	"X = pad_sequences(X, maxlen=12)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"We have used a fixed size of 12 here but anything works really. Now the sequences with integers representing word-index are mapped to a 10-dimensional vector space using the wrod2vec embedding and we're good to go:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 48,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"<keras.callbacks.History at 0x1217f7d30>"
	]
	},
	"execution_count": 48,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"model = Sequential()\n",
	"model.add(embedding_layer)\n",
	"model.add(Flatten())\n",
	"model.add(Dense(1, activation='sigmoid'))\n",
	"model.layers[0].trainable=False\n",
	"model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics = ['accuracy'])\n",
	"model.fit(X, y=y, batch_size=20, epochs=700, verbose=0, validation_split=0.2, shuffle=True)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"You get the same razor sharp prediction. I know all of the above networks are overkill for the simple datasets but the intention was to show you the way to use the various NLP functionalities."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 49,
	"metadata": {},
	"outputs": [
	{
	"data": {
	"text/plain": [
	"array([[9.9771714e-01],\n",
	" [2.9435821e-04],\n",
	" [2.2714464e-03],\n",
	" [1.3460572e-03],\n",
	" [1.2431095e-02]], dtype=float32)"
	]
	},
	"execution_count": 49,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"model.predict(X)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": []
	}
	],
	"metadata": {
	"anaconda-cloud": {},
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.6.4"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 1
	}