ia35/word_embeddings.ipynb

## word_embeddings.ipynb
{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "accelerator": "GPU",
    "colab": {
      "name": "Word_embeddings.ipynb",
      "provenance": [],
      "private_outputs": true,
      "collapsed_sections": [],
      "include_colab_link": true
    },
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/gist/ia35/5ec100de12593cf722a8a416e646e54d/word_embeddings.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "TO6zfmo61PKA",
        "colab_type": "text"
      },
      "source": [
        "[![](http://bec552ebfe.url-de-test.ws/ml/buttonBackProp.png)](https://www.backprop.fr)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "9Md6REaL1XxN",
        "colab_type": "text"
      },
      "source": [
        "Ce code provient du tutoriel : [Word Embeddings](https://www.tensorflow.org/tutorials/text/word_embeddings) de TensorFlow"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "8uZYS2W71r0x",
        "colab_type": "text"
      },
      "source": [
        "[![](https://raw.githubusercontent.com/BackProp-fr/meetup/master/images/LogoBackPropTranspSmall.png)](https://www.backprop.fr)\n",
        "Le logo BackProp est présenté chaque fois qu'un ajout, une modification importante est apportée au code ou à chaque fois qu'un commentaire doit être signalé. \n",
        "\n",
        "Le texte en anglais est soit le texte d'origine soit un extrait de site qui apporte des explications."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "nq01mc191v8h",
        "colab_type": "text"
      },
      "source": [
        "## <font color=\"teal\">Références</font>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "6tVY3TUA11R6",
        "colab_type": "text"
      },
      "source": [
        "- [Word Embeddings](https://www.tensorflow.org/tutorials/text/word_embeddings) de TensorFlow"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "GE91qWZkm8ZQ"
      },
      "source": [
        "##### Copyright 2019 The TensorFlow Authors."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "cellView": "form",
        "colab_type": "code",
        "id": "YS3NA-i6nAFC",
        "colab": {}
      },
      "source": [
        "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n",
        "# you may not use this file except in compliance with the License.\n",
        "# You may obtain a copy of the License at\n",
        "#\n",
        "# https://www.apache.org/licenses/LICENSE-2.0\n",
        "#\n",
        "# Unless required by applicable law or agreed to in writing, software\n",
        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
        "# See the License for the specific language governing permissions and\n",
        "# limitations under the License."
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "7SN5USFEIIK3"
      },
      "source": [
        "# Word embeddings"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "SZUQErGewZxE"
      },
      "source": [
        "## Setup"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "SIXEk5ON5P7h",
        "colab": {}
      },
      "source": [
        "import tensorflow as tf"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "7lPqhc2k2S6_",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "print (tf.__version__)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "RutaI-Tpev3T",
        "colab": {}
      },
      "source": [
        "from tensorflow import keras\n",
        "from tensorflow.keras import layers\n",
        "\n",
        "import tensorflow_datasets as tfds\n",
        "tfds.disable_progress_bar()"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "eqBazMiVQkj1"
      },
      "source": [
        "## Using the Embedding layer\n",
        "\n",
        "Keras makes it easy to use word embeddings. Let's take a look at the [Embedding](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding) layer.\n",
        "\n",
        "The Embedding layer can be understood as a lookup table that maps from integer indices (which stand for specific words) to dense vectors (their embeddings). The dimensionality (or width) of the embedding is a parameter you can experiment with to see what works well for your problem, much in the same way you would experiment with the number of neurons in a Dense layer.\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "-OjxLVrMvWUE",
        "colab": {}
      },
      "source": [
        "embedding_layer = layers.Embedding(1000, 5)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "2dKKV1L2Rk7e"
      },
      "source": [
        "When you create an Embedding layer, the weights for the embedding are randomly initialized (just like any other layer). During training, they are gradually adjusted via backpropagation. Once trained, the learned word embeddings will roughly encode similarities between words (as they were learned for the specific problem your model is trained on).\n",
        "\n",
        "If you pass an integer to an embedding layer, the result replaces each integer with the vector from the embedding table:"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "0YUjPgP7w0PO",
        "colab": {}
      },
      "source": [
        "result = embedding_layer(tf.constant([1,2,3]))\n",
        "result.numpy()"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "O4PC4QzsxTGx"
      },
      "source": [
        "For text or sequence problems, the Embedding layer takes a 2D tensor of integers, of shape `(samples, sequence_length)`, where each entry is a sequence of integers. It can embed sequences of variable lengths. You could feed into the embedding layer above batches with shapes `(32, 10)` (batch of 32 sequences of length 10) or `(64, 15)` (batch of 64 sequences of length 15).\n",
        "\n",
        "The returned tensor has one more axis than the input, the embedding vectors are aligned along the new last axis. Pass it a `(2, 3)` input batch and the output is `(2, 3, N)`\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "vwSYepRjyRGy",
        "colab": {}
      },
      "source": [
        "result = embedding_layer(tf.constant([[0,1,2],[3,4,5]]))\n",
        "result.shape"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "WGQp2N92yOyB"
      },
      "source": [
        "When given a batch of sequences as input, an embedding layer returns a 3D floating point tensor, of shape `(samples, sequence_length, embedding_dimensionality)`. To convert from this sequence of variable length to a fixed representation there are a variety of standard approaches. You could use an RNN, Attention, or pooling layer before passing it to a Dense layer. This tutorial uses pooling because it's simplest. The [Text Classification with an RNN](text_classification_rnn.ipynb) tutorial is a good next step."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "aGicgV5qT0wh"
      },
      "source": [
        "## Learning embeddings from scratch"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "_Bh8B1TUT6mV"
      },
      "source": [
        "In this tutorial you will train a sentiment classifier on IMDB movie reviews. In the process, the model will learn embeddings from scratch. We will use to a preprocessed dataset.\n",
        "\n",
        "To load a text dataset from scratch see the  [Loading text tutorial](../load_data/text.ipynb)."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "yg6tyxPtp1TE",
        "colab": {}
      },
      "source": [
        "(train_data, test_data), info = tfds.load(\n",
        "    'imdb_reviews/subwords8k', \n",
        "    split = (tfds.Split.TRAIN, tfds.Split.TEST), \n",
        "    with_info=True, as_supervised=True)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "jjnBsFXaLVPL"
      },
      "source": [
        "Get the encoder (`tfds.features.text.SubwordTextEncoder`), and have a quick look at the vocabulary. \n",
        "\n",
        "The \"\\_\" in the vocabulary represent spaces. Note how the vocabulary includes whole words (ending with \"\\_\") and partial words which it can use to build larger words:"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "cRi0BCuG3dLD",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "info.features"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "MYrsTgxhLBfl",
        "colab": {}
      },
      "source": [
        "encoder = info.features['text'].encoder\n",
        "encoder.subwords[:20]"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "GwCTfSG63Qth"
      },
      "source": [
        "Movie reviews can be different lengths. We will use the `padded_batch` method to standardize the lengths of the reviews."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "dRSnJkx4cs9P",
        "colab": {}
      },
      "source": [
        "tk1 = train_data.take(1)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "u4t4yP7G4R5Z",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "print(list(tk1.as_numpy_iterator()))"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "x0mA0P8S6bhT",
        "colab_type": "text"
      },
      "source": [
        "[![](https://raw.githubusercontent.com/BackProp-fr/meetup/master/images/LogoBackPropTranspSmall.png)](https://www.backprop.fr)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "MD9rZFGr6zJd",
        "colab_type": "text"
      },
      "source": [
        "Premère façon"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "39DQ306e6coM",
        "colab_type": "text"
      },
      "source": [
        "Il n'y a pas, à ma connaissance de moyen simple pour calculer le nombre d'enregistrements dans un dataset.\n",
        "\n",
        "Ci-dessous, 2 façons de faire (voir [ici](https://stackoverflow.com/questions/50737192/tf-data-dataset-how-to-get-the-dataset-size-number-of-elements-in-a-epoch))"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "2IXEQWjE5x9J",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "len(list(train_data))"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Xdbfdxo-56UJ",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "len(list(test_data))"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "L9P7F7fc6122",
        "colab_type": "text"
      },
      "source": [
        "Deuxième façon"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "airrKCWU6MiT",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "for num, _ in enumerate(train_data):\n",
        "    pass\n",
        "\n",
        "print(f'Number of elements: {num+1}')"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "upiYr1-Dc7CF"
      },
      "source": [
        "Note: As of **TensorFlow 2.2** the padded_shapes argument is no longer required. The default behavior is to pad all axes to the longest in the batch."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "ZucJ_jzoc6Sv",
        "colab": {}
      },
      "source": [
        "train_batches = train_data.shuffle(1000).padded_batch(10)\n",
        "test_batches = test_data.shuffle(1000).padded_batch(10)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "dF8ORMt2U9lj"
      },
      "source": [
        "As imported, the text of reviews is integer-encoded (each integer represents a specific word or word-part in the vocabulary).\n",
        "\n",
        "Note the trailing zeros, because the batch is padded to the longest example."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "Se-phCknsoan",
        "colab": {}
      },
      "source": [
        "train_batch, train_labels = next(iter(train_batches))\n",
        "train_batch.numpy()"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "dCgfPYd19AQT",
        "colab_type": "text"
      },
      "source": [
        "[![](https://raw.githubusercontent.com/BackProp-fr/meetup/master/images/LogoBackPropTranspSmall.png)](https://www.backprop.fr)"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "ON6lAqL18XEC",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "len(train_batch[0])"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "7Av_pqxeBEWm",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "train_batch[0]"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "zI9_wLIiWO8Z"
      },
      "source": [
        "### Create a simple model\n",
        "\n",
        "We will use the [Keras Sequential API](../../guide/keras) to define our model. In this case it is a \"Continuous bag of words\" style model.\n",
        "\n",
        "* Next the Embedding layer takes the integer-encoded vocabulary and looks up the embedding vector for each word-index. These vectors are learned as the model trains. The vectors add a dimension to the output array. The resulting dimensions are: `(batch, sequence, embedding)`.\n",
        "\n",
        "* Next, a GlobalAveragePooling1D layer returns a fixed-length output vector for each example by averaging over the sequence dimension. This allows the model to handle input of variable length, in the simplest way possible.\n",
        "\n",
        "* This fixed-length output vector is piped through a fully-connected (Dense) layer with 16 hidden units.\n",
        "\n",
        "* The last layer is densely connected with a single output node. Using the sigmoid activation function, this value is a float between 0 and 1, representing a probability (or confidence level) that the review is positive.\n",
        "\n",
        "Caution: This model doesn't use masking, so the zero-padding is used as part of the input, so the padding length may affect the output.  To fix this, see the [masking and padding guide](../../guide/keras/masking_and_padding)."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "tta9J52zBZzA",
        "colab_type": "text"
      },
      "source": [
        "[![](https://raw.githubusercontent.com/BackProp-fr/meetup/master/images/LogoBackPropTranspSmall.png)](https://www.backprop.fr)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "U8GTedmkBSr-",
        "colab_type": "text"
      },
      "source": [
        "Le modèle est simple. "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "tAspat33BdXm",
        "colab_type": "text"
      },
      "source": [
        "Pourquoi utilise t-on GlobalAveragePooling1D ?"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "LnOnTYAIEfnQ",
        "colab_type": "text"
      },
      "source": [
        "[Global Average Pooling 2D](https://alexisbcook.github.io/2017/global-average-pooling-layers-for-object-localization/) est relativement bien connu en classification d'images mais que fait 1D ? la même chose mais en 1D !"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "HGL1VCHzCLGI",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "input_shape = (2, 3, 4)\n",
        "x = tf.random.normal(input_shape)\n",
        "x"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "ZQjbqDulCUX4",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "y = tf.keras.layers.GlobalAveragePooling1D()(x)\n",
        "y"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "pHLcFtn5Wsqj",
        "colab": {}
      },
      "source": [
        "embedding_dim=16\n",
        "\n",
        "model = keras.Sequential([\n",
        "  layers.Embedding(encoder.vocab_size, embedding_dim),\n",
        "  layers.GlobalAveragePooling1D(),\n",
        "  layers.Dense(16, activation='relu'),\n",
        "  layers.Dense(1)\n",
        "])\n",
        "\n",
        "model.summary()"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "JjLNgKO7W2fe"
      },
      "source": [
        "### Compile and train the model"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "lCUgdP69Wzix",
        "colab": {}
      },
      "source": [
        "model.compile(optimizer='adam',\n",
        "              loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),\n",
        "              metrics=['accuracy'])\n",
        "\n",
        "history = model.fit(\n",
        "    train_batches,\n",
        "    epochs=10,\n",
        "    validation_data=test_batches, validation_steps=20)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "LQjpKVYTXU-1"
      },
      "source": [
        "With this approach our model reaches a validation accuracy of around 88% (note the model is overfitting, training accuracy is significantly higher)."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "0D3OTmOT1z1O",
        "colab": {}
      },
      "source": [
        "import matplotlib.pyplot as plt\n",
        "\n",
        "history_dict = history.history\n",
        "\n",
        "acc = history_dict['accuracy']\n",
        "val_acc = history_dict['val_accuracy']\n",
        "loss=history_dict['loss']\n",
        "val_loss=history_dict['val_loss']\n",
        "\n",
        "epochs = range(1, len(acc) + 1)\n",
        "\n",
        "plt.figure(figsize=(12,9))\n",
        "plt.plot(epochs, loss, 'bo', label='Training loss')\n",
        "plt.plot(epochs, val_loss, 'b', label='Validation loss')\n",
        "plt.title('Training and validation loss')\n",
        "plt.xlabel('Epochs')\n",
        "plt.ylabel('Loss')\n",
        "plt.legend()\n",
        "plt.show()\n",
        "\n",
        "plt.figure(figsize=(12,9))\n",
        "plt.plot(epochs, acc, 'bo', label='Training acc')\n",
        "plt.plot(epochs, val_acc, 'b', label='Validation acc')\n",
        "plt.title('Training and validation accuracy')\n",
        "plt.xlabel('Epochs')\n",
        "plt.ylabel('Accuracy')\n",
        "plt.legend(loc='lower right')\n",
        "plt.ylim((0.5,1))\n",
        "plt.show()"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "KCoA6qwqP836"
      },
      "source": [
        "## Retrieve the learned embeddings\n",
        "\n",
        "Next, let's retrieve the word embeddings learned during training. This will be a matrix of shape `(vocab_size, embedding-dimension)`."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "t8WwbsXCXtpa",
        "colab": {}
      },
      "source": [
        "e = model.layers[0]\n",
        "weights = e.get_weights()[0]\n",
        "print(weights.shape) # shape: (vocab_size, embedding_dim)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "J8MiCA77X8B8"
      },
      "source": [
        "We will now write the weights to disk. To use the [Embedding Projector](http://projector.tensorflow.org), we will upload two files in tab separated format: a file of vectors (containing the embedding), and a file of meta data (containing the words)."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "GsjempweP9Lq",
        "colab": {}
      },
      "source": [
        "import io\n",
        "\n",
        "encoder = info.features['text'].encoder\n",
        "\n",
        "out_v = io.open('vecs.tsv', 'w', encoding='utf-8')\n",
        "out_m = io.open('meta.tsv', 'w', encoding='utf-8')\n",
        "\n",
        "for num, word in enumerate(encoder.subwords):\n",
        "  vec = weights[num+1] # skip 0, it's padding.\n",
        "  out_m.write(word + \"\\n\")\n",
        "  out_v.write('\\t'.join([str(x) for x in vec]) + \"\\n\")\n",
        "out_v.close()\n",
        "out_m.close()"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "JQyMZWyxYjMr"
      },
      "source": [
        "If you are running this tutorial in [Colaboratory](https://colab.research.google.com), you can use the following snippet to download these files to your local machine (or use the file browser, *View -> Table of contents -> File browser*)."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab_type": "code",
        "id": "-gFbbMmvYvhp",
        "colab": {}
      },
      "source": [
        "try:\n",
        "  from google.colab import files\n",
        "except ImportError:\n",
        "   pass\n",
        "else:\n",
        "  files.download('vecs.tsv')\n",
        "  files.download('meta.tsv')"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "PXLfFA54Yz-o"
      },
      "source": [
        "## Visualize the embeddings\n",
        "\n",
        "To visualize our embeddings we will upload them to the embedding projector.\n",
        "\n",
        "Open the [Embedding Projector](http://projector.tensorflow.org/) (this can also run in a local TensorBoard instance).\n",
        "\n",
        "* Click on \"Load data\".\n",
        "\n",
        "* Upload the two files we created above: `vecs.tsv` and `meta.tsv`.\n",
        "\n",
        "The embeddings you have trained will now be displayed. You can search for words to find their closest neighbors. For example, try searching for \"beautiful\". You may see neighbors like \"wonderful\". \n",
        "\n",
        "Note: your results may be a bit different, depending on how weights were randomly initialized before training the embedding layer.\n",
        "\n",
        "Note: experimentally, you may be able to produce more interpretable embeddings by using a simpler model. Try deleting the `Dense(16)` layer, retraining the model, and visualizing the embeddings again.\n",
        "\n",
        "<img src=\"https://github.com/tensorflow/docs/blob/master/site/en/tutorials/text/images/embedding.jpg?raw=1\" alt=\"Screenshot of the embedding projector\" width=\"400\"/>\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "iS_uMeMw3Xpj"
      },
      "source": [
        "## Next steps\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "BSgAZpwF5xF_"
      },
      "source": [
        "This tutorial has shown you how to train and visualize word embeddings from scratch on a small dataset.\n",
        "\n",
        "* To learn about recurrent networks see the [Keras RNN Guide](../../guide/keras/rnn.ipynb).\n",
        "\n",
        "* To learn more about text classification (including the overall workflow, and if you're curious about when to use embeddings vs one-hot encodings) we recommend this practical text classification [guide](https://developers.google.com/machine-learning/guides/text-classification/step-2-5)."
      ]
    }
  ]
}