BrikerMan/nlp-preprosssing.ipynb

## nlp-preprosssing.ipynb
{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "nlp-preprosssing.ipynb",
      "version": "0.3.2",
      "provenance": [],
      "collapsed_sections": [],
      "include_colab_link": true
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.7.3"
    },
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/gist/BrikerMan/8212d33e2824a9f8562b59a764f418f6/nlp-preprosssing.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Dcagip0WIefl",
        "colab_type": "text"
      },
      "source": [
        "# Processing Text\n",
        "## Steps\n",
        "\n",
        "* converting all letters to lower or upper case\n",
        "* removing or replace punctuations, accent marks and other diacritics\n",
        "* removing white spaces\n",
        "* expanding abbreviations\n",
        "\n",
        "### example Tweet:\n",
        "\n",
        "\"When will the Radical Left Wing Media apologize to me for knowingly getting the Russia Collusion Delusion story so wrong? The real story is about to happen! Why is @nytimes, @washingtonpost, @CNN, @MSNBC allowed to be on Twitter & Facebook. Much of what they do is FAKE NEWS!\"\n",
        "\n",
        "### example Weibo:\n",
        "\n",
        "\"巴黎這兩天蠻冷的！但是也不妨礙我到處走走看看的好心情。假期結束準備開工的朋友們，要加油喔～祝大家#立夏#快樂～[米妮爱你] [米妮爱你] [米妮爱你]\""
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "qjMVg8jgIefm",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "tweet = \"When will the Radical Left Wing Media apologize to me for knowingly getting the Russia Collusion Delusion story so wrong? The real story is about to happen! Why is @nytimes, @washingtonpost, @CNN, @MSNBC allowed to be on Twitter & Facebook. Much of what they do is FAKE NEWS!\""
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "kJix67JeIefp",
        "colab_type": "code",
        "outputId": "505f3d77-44bb-43db-cc20-2d932d340cf9",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 71
        }
      },
      "source": [
        "lower_cased_tweet = tweet.lower()\n",
        "print(lower_cased_tweet)"
      ],
      "execution_count": 3,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "when will the radical left wing media apologize to me for knowingly getting the russia collusion delusion story so wrong? the real story is about to happen! why is @nytimes, @washingtonpost, @cnn, @msnbc allowed to be on twitter & facebook. much of what they do is fake news!\n",
            "when will the radical left wing media apologize to me for knowingly getting the russia collusion delusion story so wrong? the real story is about to happen! why is @nytimes, @washingtonpost, @cnn, @msnbc allowed to be on twitter & facebook. much of what they do is fake news!\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "NLhBoMcYIefv",
        "colab_type": "code",
        "outputId": "4681b096-e135-47a5-9d2d-3937f36f0590",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 54
        }
      },
      "source": [
        "# Replace & with and\n",
        "lower_cased_tweet = lower_cased_tweet.replace('&', 'and')\n",
        "print(lower_cased_tweet)"
      ],
      "execution_count": 4,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "when will the radical left wing media apologize to me for knowingly getting the russia collusion delusion story so wrong? the real story is about to happen! why is @nytimes, @washingtonpost, @cnn, @msnbc allowed to be on twitter and facebook. much of what they do is fake news!\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "AOQqBDYKIefx",
        "colab_type": "code",
        "outputId": "5f3ec6a0-52c8-44a1-a0bd-792cc156c97d",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 54
        }
      },
      "source": [
        "# Then we use keras's preprocssing function tokenize this sentence\n",
        "import tensorflow.keras.preprocessing.text as kp_text\n",
        "\n",
        "tokenized_tweet = kp_text.text_to_word_sequence(lower_cased_tweet,\n",
        "                                               filters='!\"#$%&()*+,-./:;<=>?[\\\\]^_`{|}~\\t\\n',\n",
        "                                               lower=True, \n",
        "                                               split=\" \")\n",
        "print(tokenized_tweet)"
      ],
      "execution_count": 6,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "['when', 'will', 'the', 'radical', 'left', 'wing', 'media', 'apologize', 'to', 'me', 'for', 'knowingly', 'getting', 'the', 'russia', 'collusion', 'delusion', 'story', 'so', 'wrong', 'the', 'real', 'story', 'is', 'about', 'to', 'happen', 'why', 'is', '@nytimes', '@washingtonpost', '@cnn', '@msnbc', 'allowed', 'to', 'be', 'on', 'twitter', 'and', 'facebook', 'much', 'of', 'what', 'they', 'do', 'is', 'fake', 'news']\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "yR-o9EyWIefz",
        "colab_type": "text"
      },
      "source": [
        "# Processing Chinese text"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "oNTac2_CIef0",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "weibo = \"巴黎這兩天蠻冷的！但是也不妨礙我到處走走看看的好心情。假期結束準備開工的朋友們，要加油喔～祝大家#立夏#快樂～[米妮爱你] [米妮爱你] [米妮爱你]\""
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "V_0e8ekEIef2",
        "colab_type": "code",
        "outputId": "3ca3b480-ee43-4d9e-8ee7-2086f380e473",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 170
        }
      },
      "source": [
        "# install Simplified and Traditional Chinese convert tool\n",
        "!pip install hanziconv"
      ],
      "execution_count": 9,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Collecting hanziconv\n",
            "\u001b[?25l  Downloading https://files.pythonhosted.org/packages/63/71/b89cb63077fd807fe31cf7c016a06e7e579a289d8a37aa24a30282d02dd2/hanziconv-0.3.2.tar.gz (276kB)\n",
            "\u001b[K     |████████████████████████████████| 286kB 4.8MB/s \n",
            "\u001b[?25hBuilding wheels for collected packages: hanziconv\n",
            "  Building wheel for hanziconv (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
            "  Stored in directory: /root/.cache/pip/wheels/03/d8/3c/c39898fa9c9ce6e34b0ab4c6604892462d440c743715c94054\n",
            "Successfully built hanziconv\n",
            "Installing collected packages: hanziconv\n",
            "Successfully installed hanziconv-0.3.2\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "FU4VAYdBIef4",
        "colab_type": "code",
        "outputId": "027da152-4d25-4ab7-956e-4652e0fb0e07",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 34
        }
      },
      "source": [
        "from hanziconv import HanziConv\n",
        "simplified_weibo = HanziConv.toSimplified(weibo)\n",
        "print(simplified_weibo)"
      ],
      "execution_count": 10,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "巴黎这两天蛮冷的！但是也不妨碍我到处走走看看的好心情。假期结束准备开工的朋友们，要加油喔～祝大家#立夏#快乐～[米妮爱你] [米妮爱你] [米妮爱你]\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "uw3osDWCIef8",
        "colab_type": "code",
        "outputId": "767a65ab-13f0-48ab-e540-f71d58e993dd",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 34
        }
      },
      "source": [
        "# Remove the emoji text and hashtags\n",
        "import re\n",
        "simplified_weibo = re.sub(r'\\[.{4}\\]', ' ', simplified_weibo)\n",
        "simplified_weibo = re.sub(r'#.*#', ' ', simplified_weibo)\n",
        "simplified_weibo"
      ],
      "execution_count": 11,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "'巴黎这两天蛮冷的！但是也不妨碍我到处走走看看的好心情。假期结束准备开工的朋友们，要加油喔～祝大家 快乐～     '"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 11
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "8ZLw48LBIegA",
        "colab_type": "code",
        "outputId": "550670fe-3e33-4dda-d4d7-b424e30335d7",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 34
        }
      },
      "source": [
        "# Remove the white space\n",
        "simplified_weibo = simplified_weibo.strip()\n",
        "simplified_weibo"
      ],
      "execution_count": 13,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "'巴黎这两天蛮冷的！但是也不妨碍我到处走走看看的好心情。假期结束准备开工的朋友们，要加油喔～祝大家 快乐～'"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 13
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "JnIqHX98IegC",
        "colab_type": "code",
        "outputId": "4d7c4dd9-cc59-4a57-97a8-c4801f383a2a",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 34
        }
      },
      "source": [
        "# Install Chinese text cutter\n",
        "!pip install jieba"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Requirement already satisfied: jieba in /usr/local/lib/python3.6/dist-packages (0.39)\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "ZRwdChQZIegG",
        "colab_type": "code",
        "outputId": "cae1d507-b18f-4d02-b3e3-31600b347224",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 122
        }
      },
      "source": [
        "import jieba\n",
        "segmented_weibo = list(jieba.cut(simplified_weibo))\n",
        "print(segmented_weibo)"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Building prefix dict from the default dictionary ...\n",
            "Dumping model to file cache /tmp/jieba.cache\n",
            "Loading model cost 1.083 seconds.\n",
            "Prefix dict has been built succesfully.\n"
          ],
          "name": "stderr"
        },
        {
          "output_type": "stream",
          "text": [
            "['巴黎', '这', '两天', '蛮', '冷', '的', '！', '但是', '也', '不', '妨碍', '我', '到处', '走走看看', '的', '好', '心情', '。', '假期', '结束', '准备', '开工', '的', '朋友', '们', '，', '要', '加油', '喔', '～', '祝', '大家', ' ', '快乐', '～']\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "QpfnqpNYIegJ",
        "colab_type": "code",
        "outputId": "d8b4b864-8a05-4a7f-cbcd-d442dacdea59",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 54
        }
      },
      "source": [
        "tokenized_weibo = []\n",
        "for word in segmented_weibo:\n",
        "    if word not in [\"！\", \"。\", \"，\", \"～\", ' ']:\n",
        "        tokenized_weibo.append(word)\n",
        "print(tokenized_weibo)"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "['巴黎', '这', '两天', '蛮', '冷', '的', '但是', '也', '不', '妨碍', '我', '到处', '走走看看', '的', '好', '心情', '假期', '结束', '准备', '开工', '的', '朋友', '们', '要', '加油', '喔', '祝', '大家', '快乐']\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "YtXqhzLLIegM",
        "colab_type": "code",
        "outputId": "e17fc6a4-3736-40ab-9dc2-8ebd1c1253a1",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 54
        }
      },
      "source": [
        "segmented_weibo_text = ' '.join(jieba.cut(simplified_weibo))\n",
        "print(segmented_weibo_text)"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "巴黎 这 两天 蛮 冷 的 ！ 但是 也 不 妨碍 我 到处 走走看看 的 好 心情 。 假期 结束 准备 开工 的 朋友 们 ， 要 加油 喔 ～ 祝 大家   快乐 ～\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "h8xucd2oJ2hI",
        "colab_type": "code",
        "outputId": "01fe8cec-e0e8-43a6-c33b-1ecabf0163e4",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 54
        }
      },
      "source": [
        "tokenized_weibo = kp_text.text_to_word_sequence(segmented_weibo_text,\n",
        "                                               filters='!\"#$%&()*+,-./:;<=>?[\\\\]^_`{|}~\\t\\n！。，～',\n",
        "                                               lower=True, \n",
        "                                               split=\" \")\n",
        "print(tokenized_weibo)"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "['巴黎', '这', '两天', '蛮', '冷', '的', '但是', '也', '不', '妨碍', '我', '到处', '走走看看', '的', '好', '心情', '假期', '结束', '准备', '开工', '的', '朋友', '们', '要', '加油', '喔', '祝', '大家', '快乐']\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "LgfwplDQKkeA",
        "colab_type": "text"
      },
      "source": [
        "## Convert Token to token index"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "tvf3TzGwKqno",
        "colab_type": "code",
        "outputId": "999d72dd-316f-4628-c8a9-25eec41d706c",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 54
        }
      },
      "source": [
        "# Build token dict\n",
        "\n",
        "token_index = {\n",
        "    '<PAD>' : 0,  # Padding placeholder\n",
        "    '<UNK>' : 1   # unknown/ New word placeholder\n",
        "}\n",
        "\n",
        "for token in tokenized_tweet:\n",
        "  if token not in token_index:\n",
        "    token_index[token] =len(token_index)\n",
        "\n",
        "print(token_index)"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "{'<PAD>': 0, '<UNK>': 1, 'when': 2, 'will': 3, 'the': 4, 'radical': 5, 'left': 6, 'wing': 7, 'media': 8, 'apologize': 9, 'to': 10, 'me': 11, 'for': 12, 'knowingly': 13, 'getting': 14, 'russia': 15, 'collusion': 16, 'delusion': 17, 'story': 18, 'so': 19, 'wrong': 20, 'real': 21, 'is': 22, 'about': 23, 'happen': 24, 'why': 25, '@nytimes': 26, '@washingtonpost': 27, '@cnn': 28, '@msnbc': 29, 'allowed': 30, 'be': 31, 'on': 32, 'twitter': 33, 'and': 34, 'facebook': 35, 'much': 36, 'of': 37, 'what': 38, 'they': 39, 'do': 40, 'fake': 41, 'news': 42}\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "JEFw2FUlLEzt",
        "colab_type": "code",
        "outputId": "3459eccb-283f-47eb-e441-b7b01617efbe",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 51
        }
      },
      "source": [
        "# Tokenize Sentence\n",
        "\n",
        "def tokenize(sentence):\n",
        "  token_ids = []\n",
        "  for token in sentence:\n",
        "    token_ids.append(token_index.get(token, token_index['<UNK>']))\n",
        "  return token_ids\n",
        "\n",
        "print(tokenize('the real story is about to happen'.split(' ')))\n",
        "print(tokenize('please apologize to me'.split(' ')))\n"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "[4, 21, 18, 22, 23, 10, 24]\n",
            "[1, 9, 10, 11]\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "idFpvGEXMof3",
        "colab_type": "text"
      },
      "source": [
        "### Word Embedding\n",
        "\n",
        "# Example 1\n",
        " \n",
        "* The X is a small carnivorous mammal.\n",
        "* The Y is a small songbird with a roundish body\n",
        "\n",
        "\n",
        "# Example 2\n",
        "\n",
        "Lets assume word with same length is some word.\n",
        "\n",
        "* A ███ ████ ██ █████  100.\n",
        "* B ███ ████ ██ █████  99.\n",
        "* C ███ ████ ██ █████  cat.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "C-gzf0X_OBH5",
        "colab_type": "text"
      },
      "source": [
        "# The Distributional Hypothesis\n",
        "\n",
        "The Distributional Hypothesis is that words that occur in the same contexts tend to have similar meanings (Harris, 1954). \n",
        "\n",
        "\n",
        "# Word Embedding methods\n",
        "\n",
        "* Word2Vec \n",
        "* Glove"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "w9DjtxO0TYDV",
        "colab_type": "code",
        "outputId": "c1b740c7-53dd-48aa-9e01-7ecfd2ff34d1",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 666
        }
      },
      "source": [
        "# Install gensim\n",
        "!pip install gensim\n",
        "# Download and unzip glove model\n",
        "!wget http://nlp.stanford.edu/data/glove.6B.zip\n",
        "!unzip glove.6B.zip\n"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Requirement already satisfied: gensim in /usr/local/lib/python3.6/dist-packages (3.6.0)\n",
            "Requirement already satisfied: scipy>=0.18.1 in /usr/local/lib/python3.6/dist-packages (from gensim) (1.2.1)\n",
            "Requirement already satisfied: numpy>=1.11.3 in /usr/local/lib/python3.6/dist-packages (from gensim) (1.16.3)\n",
            "Requirement already satisfied: smart-open>=1.2.1 in /usr/local/lib/python3.6/dist-packages (from gensim) (1.8.3)\n",
            "Requirement already satisfied: six>=1.5.0 in /usr/local/lib/python3.6/dist-packages (from gensim) (1.12.0)\n",
            "Requirement already satisfied: boto3 in /usr/local/lib/python3.6/dist-packages (from smart-open>=1.2.1->gensim) (1.9.139)\n",
            "Requirement already satisfied: boto>=2.32 in /usr/local/lib/python3.6/dist-packages (from smart-open>=1.2.1->gensim) (2.49.0)\n",
            "Requirement already satisfied: requests in /usr/local/lib/python3.6/dist-packages (from smart-open>=1.2.1->gensim) (2.21.0)\n",
            "Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /usr/local/lib/python3.6/dist-packages (from boto3->smart-open>=1.2.1->gensim) (0.9.4)\n",
            "Requirement already satisfied: s3transfer<0.3.0,>=0.2.0 in /usr/local/lib/python3.6/dist-packages (from boto3->smart-open>=1.2.1->gensim) (0.2.0)\n",
            "Requirement already satisfied: botocore<1.13.0,>=1.12.139 in /usr/local/lib/python3.6/dist-packages (from boto3->smart-open>=1.2.1->gensim) (1.12.139)\n",
            "Requirement already satisfied: idna<2.9,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests->smart-open>=1.2.1->gensim) (2.8)\n",
            "Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests->smart-open>=1.2.1->gensim) (3.0.4)\n",
            "Requirement already satisfied: urllib3<1.25,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests->smart-open>=1.2.1->gensim) (1.24.2)\n",
            "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests->smart-open>=1.2.1->gensim) (2019.3.9)\n",
            "Requirement already satisfied: python-dateutil<3.0.0,>=2.1; python_version >= \"2.7\" in /usr/local/lib/python3.6/dist-packages (from botocore<1.13.0,>=1.12.139->boto3->smart-open>=1.2.1->gensim) (2.5.3)\n",
            "Requirement already satisfied: docutils>=0.10 in /usr/local/lib/python3.6/dist-packages (from botocore<1.13.0,>=1.12.139->boto3->smart-open>=1.2.1->gensim) (0.14)\n",
            "--2019-05-06 04:02:33--  http://nlp.stanford.edu/data/glove.6B.zip\n",
            "Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140\n",
            "Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.\n",
            "HTTP request sent, awaiting response... 302 Found\n",
            "Location: https://nlp.stanford.edu/data/glove.6B.zip [following]\n",
            "--2019-05-06 04:02:34--  https://nlp.stanford.edu/data/glove.6B.zip\n",
            "Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.\n",
            "HTTP request sent, awaiting response... 200 OK\n",
            "Length: 862182613 (822M) [application/zip]\n",
            "Saving to: ‘glove.6B.zip’\n",
            "\n",
            "glove.6B.zip        100%[===================>] 822.24M  30.9MB/s    in 25s     \n",
            "\n",
            "2019-05-06 04:02:59 (33.0 MB/s) - ‘glove.6B.zip’ saved [862182613/862182613]\n",
            "\n",
            "Archive:  glove.6B.zip\n",
            "  inflating: glove.6B.50d.txt        \n",
            "  inflating: glove.6B.100d.txt       \n",
            "  inflating: glove.6B.200d.txt       \n",
            "  inflating: glove.6B.300d.txt       \n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "2CFd1uJ4TukJ",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "# Conver gensim embedding format to gensim format\n",
        "from gensim.models import KeyedVectors\n",
        "from gensim.scripts.glove2word2vec import glove2word2vec\n",
        "\n",
        "\n",
        "_ = glove2word2vec('glove.6B.50d.txt', \"gensim.6B.50d.txt\")"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "i9HWjcIJUQsJ",
        "colab_type": "code",
        "outputId": "0d9e4ed2-05ef-40e4-e2c6-b96d8fedc8c2",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 34
        }
      },
      "source": [
        "# Load model\n",
        "w2v = KeyedVectors.load_word2vec_format(\"gensim.6B.50d.txt\")\n",
        "print('Model loaded')"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Model loaded\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "c5sPkYd8UuCh",
        "colab_type": "code",
        "outputId": "d426d0b1-5501-4b76-b450-209ffce096df",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 170
        }
      },
      "source": [
        "w2v['police']"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "array([ 0.49725 , -1.1949  ,  0.37137 , -0.081662,  0.69114 , -0.69982 ,\n",
              "       -0.25723 ,  0.5943  ,  0.059978, -1.499   , -0.07122 , -1.0053  ,\n",
              "       -0.73845 , -0.40988 ,  0.43074 , -0.46757 , -0.36498 ,  0.29674 ,\n",
              "       -0.62775 , -0.41573 ,  0.28614 ,  1.1718  , -0.21516 ,  0.62029 ,\n",
              "       -0.85242 , -2.4672  ,  0.14414 ,  0.066415, -0.37916 , -0.65373 ,\n",
              "        2.7482  , -0.28856 , -0.45409 , -1.354   ,  0.58534 ,  1.0112  ,\n",
              "        0.67715 , -1.1708  ,  0.36475 ,  1.1886  , -0.28727 ,  1.2292  ,\n",
              "        0.58489 , -0.20625 ,  0.90859 , -0.88349 , -0.85085 ,  0.12378 ,\n",
              "        0.85397 , -0.65035 ], dtype=float32)"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 33
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "AXrAlfodW41o",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "w2v.most_similar('twitter')"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "mvJs1V-vW5sd",
        "colab_type": "code",
        "outputId": "9d5952a3-bcc6-4b09-b9f3-5d36e776e553",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 204
        }
      },
      "source": [
        "vector = w2v['king'] - w2v['man'] + w2v['girl']\n",
        "vector"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "array([ 0.48325002,  1.14838   , -1.17521   , -0.07914096,  0.73416   ,\n",
              "        0.49496996,  0.1458    ,  0.59867   , -0.517686  , -0.580476  ,\n",
              "       -0.07308602,  0.82494   ,  0.137042  , -0.8883    ,  0.81584007,\n",
              "        0.8903699 , -0.5582    , -0.42242998, -0.29918298,  0.80211   ,\n",
              "        0.75101   ,  0.23294997,  0.141601  ,  0.14125001,  0.46431997,\n",
              "       -2.0959    , -1.54637   , -0.87736   , -0.17232001,  0.23888993,\n",
              "        1.9121    ,  0.17994   , -0.59101003,  0.70541   ,  0.54674006,\n",
              "        0.0413    , -0.10025001, -0.20222002, -1.02404   ,  0.00300002,\n",
              "        0.37078002,  0.79142   , -0.32330203, -0.78456   , -0.21645999,\n",
              "        0.51207006, -0.66792   , -1.6194    , -0.58851   , -0.513301  ],\n",
              "      dtype=float32)"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 45
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "QNevk531XBv7",
        "colab_type": "code",
        "outputId": "7a3178e0-f13b-4345-c06f-5b4892bf7c81",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 241
        }
      },
      "source": [
        "w2v.similar_by_vector(vector)"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "/usr/local/lib/python3.6/dist-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.\n",
            "  if np.issubdtype(vec.dtype, np.int):\n"
          ],
          "name": "stderr"
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "[('king', 0.9276771545410156),\n",
              " ('queen', 0.871542751789093),\n",
              " ('prince', 0.8473458290100098),\n",
              " ('kingdom', 0.7919527888298035),\n",
              " ('princess', 0.7683680653572083),\n",
              " ('throne', 0.7578790783882141),\n",
              " ('son', 0.7313674092292786),\n",
              " ('ii', 0.7281480431556702),\n",
              " ('elizabeth', 0.7234716415405273),\n",
              " ('coronation', 0.7180668115615845)]"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 46
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Cap5GJP7YHaX",
        "colab_type": "code",
        "outputId": "346463f8-4ee6-4e89-9187-4b53b1cd22c9",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 105
        }
      },
      "source": [
        "print(w2v.similarity('man', 'woman'))\n",
        "print(w2v.similarity('man', 'cup'))"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "0.8860338\n",
            "0.33476463\n"
          ],
          "name": "stdout"
        },
        {
          "output_type": "stream",
          "text": [
            "/usr/local/lib/python3.6/dist-packages/gensim/matutils.py:737: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.\n",
            "  if np.issubdtype(vec.dtype, np.int):\n"
          ],
          "name": "stderr"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "hYaKYyK4Uzj9",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "def sentence2embendding(sentence):\n",
        "  embedding = []\n",
        "  for token in sentence:\n",
        "    embedding.append(w2v[token])\n",
        "  return embedding"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "4DepM3biVl3m",
        "colab_type": "code",
        "outputId": "ddcf6cc1-10ae-4eab-a43c-4c867880a5d4",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 884
        }
      },
      "source": [
        "sentence2embendding('natural language processing is fun'.split(' '))"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "[array([ 0.44265 ,  0.84765 , -0.4598  ,  0.67993 ,  0.13841 ,  0.39456 ,\n",
              "        -0.17343 , -0.64055 ,  0.86439 ,  0.81624 ,  0.75738 ,  0.41143 ,\n",
              "         1.0935  , -0.30068 , -0.08486 ,  0.51784 ,  1.087   ,  0.45061 ,\n",
              "        -0.49595 , -0.6065  , -0.16749 , -0.28557 , -0.043719, -0.86154 ,\n",
              "         0.3396  , -0.7524  , -0.33206 ,  0.24668 ,  1.0021  ,  0.71923 ,\n",
              "         3.3077  , -0.64755 ,  0.1651  , -0.91651 , -0.035363,  0.21794 ,\n",
              "        -0.87897 ,  0.37801 ,  0.66733 ,  0.42054 , -0.21387 ,  0.15917 ,\n",
              "         0.52245 ,  0.20587 , -0.16714 ,  0.58058 , -0.36828 ,  0.035571,\n",
              "         0.014099, -0.24817 ], dtype=float32),\n",
              " array([-5.7990e-01, -1.1010e-01, -1.1557e+00, -2.9906e-03, -2.0613e-01,\n",
              "         4.5289e-01, -1.6671e-01, -1.0382e+00, -9.9241e-01,  3.9884e-01,\n",
              "         5.9230e-01,  2.2990e-01,  1.5213e+00, -1.7764e-01, -2.9726e-01,\n",
              "        -3.9235e-01, -7.8471e-01,  1.5594e-01,  6.9077e-01,  5.9537e-01,\n",
              "        -4.4340e-01,  5.3514e-01,  3.2853e-01,  1.2437e+00,  1.2972e+00,\n",
              "        -1.3878e+00, -1.0925e+00, -4.0925e-01, -5.6971e-01, -3.4656e-01,\n",
              "         3.7163e+00, -1.0489e+00, -4.6708e-01, -4.4739e-01,  6.2300e-03,\n",
              "         1.9649e-02, -4.0161e-01, -6.2913e-01, -8.2506e-01,  4.5591e-01,\n",
              "         8.2626e-01,  5.7091e-01,  2.1199e-01,  4.6865e-01, -6.0027e-01,\n",
              "         2.9920e-01,  6.7944e-01,  1.4238e+00, -3.2152e-02, -1.2603e-01],\n",
              "       dtype=float32),\n",
              " array([ 1.6092e-01, -9.0221e-01,  1.5797e-01,  1.1776e+00, -6.2201e-04,\n",
              "        -1.9004e-02, -1.5081e-01, -5.8863e-01,  1.5128e+00,  4.2868e-01,\n",
              "         6.6918e-01,  1.9839e-03,  1.2855e+00, -3.5187e-01, -9.9470e-03,\n",
              "        -1.2504e-02, -6.4740e-01,  6.9845e-01,  6.5602e-01, -7.2214e-01,\n",
              "         1.3672e+00, -9.7753e-01, -9.8096e-02, -3.0653e-01,  1.9883e-01,\n",
              "        -3.5172e-01,  4.1837e-01, -3.5796e-01,  9.4309e-01, -4.2809e-01,\n",
              "         3.2094e+00, -9.3491e-01, -3.6937e-02, -4.1309e-01,  2.0524e-01,\n",
              "         9.0929e-01, -3.8058e-01,  8.0895e-01,  6.3268e-01,  5.2462e-01,\n",
              "         4.0734e-01, -2.6902e-01,  2.3058e-01,  1.5873e-01,  2.9422e-01,\n",
              "        -1.4096e-01,  9.0818e-01,  6.9391e-01, -1.2928e-01,  8.2681e-03],\n",
              "       dtype=float32),\n",
              " array([ 6.1850e-01,  6.4254e-01, -4.6552e-01,  3.7570e-01,  7.4838e-01,\n",
              "         5.3739e-01,  2.2239e-03, -6.0577e-01,  2.6408e-01,  1.1703e-01,\n",
              "         4.3722e-01,  2.0092e-01, -5.7859e-02, -3.4589e-01,  2.1664e-01,\n",
              "         5.8573e-01,  5.3919e-01,  6.9490e-01, -1.5618e-01,  5.5830e-02,\n",
              "        -6.0515e-01, -2.8997e-01, -2.5594e-02,  5.5593e-01,  2.5356e-01,\n",
              "        -1.9612e+00, -5.1381e-01,  6.9096e-01,  6.6246e-02, -5.4224e-02,\n",
              "         3.7871e+00, -7.7403e-01, -1.2689e-01, -5.1465e-01,  6.6705e-02,\n",
              "        -3.2933e-01,  1.3483e-01,  1.9049e-01,  1.3812e-01, -2.1503e-01,\n",
              "        -1.6573e-02,  3.1200e-01, -3.3189e-01, -2.6001e-02, -3.8203e-01,\n",
              "         1.9403e-01, -1.2466e-01, -2.7557e-01,  3.0899e-01,  4.8497e-01],\n",
              "       dtype=float32),\n",
              " array([-0.23764 ,  0.43119 , -0.72154 , -0.15513 ,  0.26631 , -0.4445  ,\n",
              "        -0.4452  , -0.66205 , -0.35055 ,  1.0197  , -0.91729 ,  0.20477 ,\n",
              "        -0.44747 ,  0.071965,  0.82335 , -0.023837,  0.76155 ,  0.9766  ,\n",
              "        -0.44837 , -0.7963  ,  0.027471,  1.0583  ,  0.41688 ,  0.73721 ,\n",
              "         1.1753  , -0.84602 , -1.1009  ,  0.68862 ,  1.2378  , -0.98452 ,\n",
              "         2.3631  ,  1.0793  , -0.15267 ,  0.13733 , -0.28105 ,  0.37881 ,\n",
              "        -0.15635 ,  0.47079 , -0.47013 , -0.53729 ,  0.20005 ,  0.16393 ,\n",
              "        -0.53282 ,  0.63922 ,  0.44527 , -0.10678 ,  0.44169 , -0.47774 ,\n",
              "         0.37379 ,  1.0782  ], dtype=float32)]"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 37
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "wqXcblyWWW67",
        "colab_type": "text"
      },
      "source": [
        "# Play with visualized Word2vec \n",
        "\n",
        "https://projector.tensorflow.org/"
      ]
    }
  ]
}