avidale/fasttext_similarity_weirdness.ipynb

## fasttext_similarity_weirdness.ipynb
{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "fasttext_similarity_weirdness.ipynb",
      "provenance": [],
      "collapsed_sections": [],
      "authorship_tag": "ABX9TyN/LQY3jVFwwrNjNQZUeOc0",
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/gist/avidale/c6b1d13b32a36f19750cd01148560561/fasttext_similarity_weirdness.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "GOaB-MD8XHmZ",
        "colab_type": "text"
      },
      "source": [
        "In this stub, I want to demonstrate some shit that happens when we use gensim fasttext model to search for similar words. \n",
        "\n",
        "Хочу продемонстрировать некоторое дерьмо, происходящее в gensimовской модели fasttext при поиске похожих слов."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Sb0Xoq8OUrUf",
        "colab_type": "code",
        "outputId": "b6f70c67-7dd8-4ac0-cb4d-4be85014f4ae",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 204
        }
      },
      "source": [
        "!wget http://vectors.nlpl.eu/repository/20/181.zip"
      ],
      "execution_count": 5,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "--2020-03-03 19:54:09--  http://vectors.nlpl.eu/repository/20/181.zip\n",
            "Resolving vectors.nlpl.eu (vectors.nlpl.eu)... 129.240.189.225\n",
            "Connecting to vectors.nlpl.eu (vectors.nlpl.eu)|129.240.189.225|:80... connected.\n",
            "HTTP request sent, awaiting response... 200 OK\n",
            "Length: 2622716250 (2.4G) [application/zip]\n",
            "Saving to: ‘181.zip’\n",
            "\n",
            "181.zip             100%[===================>]   2.44G  23.0MB/s    in 1m 54s  \n",
            "\n",
            "2020-03-03 19:56:09 (22.0 MB/s) - ‘181.zip’ saved [2622716250/2622716250]\n",
            "\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "RCnoZQ0AUzpB",
        "colab_type": "code",
        "outputId": "8dc1ae30-e184-4dac-81eb-5b0efc79050c",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 136
        }
      },
      "source": [
        "!unzip 181.zip"
      ],
      "execution_count": 6,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Archive:  181.zip\n",
            "  inflating: meta.json               \n",
            "  inflating: model.model             \n",
            "  inflating: model.model.vectors_ngrams.npy  \n",
            "  inflating: model.model.vectors.npy  \n",
            "  inflating: model.model.vectors_vocab.npy  \n",
            "  inflating: README                  \n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "eqxHi7oUVBrL",
        "colab_type": "code",
        "outputId": "49af8c00-85a8-4f8b-e93a-b22e7b5d3187",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 68
        }
      },
      "source": [
        "!ls"
      ],
      "execution_count": 7,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "181.zip      model.model.vectors_ngrams.npy  README\n",
            "meta.json    model.model.vectors.npy\t     sample_data\n",
            "model.model  model.model.vectors_vocab.npy\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "a_lX93QSVyCl",
        "colab_type": "code",
        "outputId": "28c19cc7-6acb-467a-bed6-d26c6aaa9840",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 511
        }
      },
      "source": [
        "!pip install gensim==3.8.1"
      ],
      "execution_count": 8,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Collecting gensim==3.8.1\n",
            "\u001b[?25l  Downloading https://files.pythonhosted.org/packages/d1/dd/112bd4258cee11e0baaaba064060eb156475a42362e59e3ff28e7ca2d29d/gensim-3.8.1-cp36-cp36m-manylinux1_x86_64.whl (24.2MB)\n",
            "\u001b[K     |████████████████████████████████| 24.2MB 1.6MB/s \n",
            "\u001b[?25hRequirement already satisfied: scipy>=0.18.1 in /usr/local/lib/python3.6/dist-packages (from gensim==3.8.1) (1.4.1)\n",
            "Requirement already satisfied: six>=1.5.0 in /usr/local/lib/python3.6/dist-packages (from gensim==3.8.1) (1.12.0)\n",
            "Requirement already satisfied: smart-open>=1.8.1 in /usr/local/lib/python3.6/dist-packages (from gensim==3.8.1) (1.9.0)\n",
            "Requirement already satisfied: numpy>=1.11.3 in /usr/local/lib/python3.6/dist-packages (from gensim==3.8.1) (1.17.5)\n",
            "Requirement already satisfied: boto3 in /usr/local/lib/python3.6/dist-packages (from smart-open>=1.8.1->gensim==3.8.1) (1.11.15)\n",
            "Requirement already satisfied: requests in /usr/local/lib/python3.6/dist-packages (from smart-open>=1.8.1->gensim==3.8.1) (2.21.0)\n",
            "Requirement already satisfied: boto>=2.32 in /usr/local/lib/python3.6/dist-packages (from smart-open>=1.8.1->gensim==3.8.1) (2.49.0)\n",
            "Requirement already satisfied: botocore<1.15.0,>=1.14.15 in /usr/local/lib/python3.6/dist-packages (from boto3->smart-open>=1.8.1->gensim==3.8.1) (1.14.15)\n",
            "Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /usr/local/lib/python3.6/dist-packages (from boto3->smart-open>=1.8.1->gensim==3.8.1) (0.9.4)\n",
            "Requirement already satisfied: s3transfer<0.4.0,>=0.3.0 in /usr/local/lib/python3.6/dist-packages (from boto3->smart-open>=1.8.1->gensim==3.8.1) (0.3.3)\n",
            "Requirement already satisfied: urllib3<1.25,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests->smart-open>=1.8.1->gensim==3.8.1) (1.24.3)\n",
            "Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests->smart-open>=1.8.1->gensim==3.8.1) (3.0.4)\n",
            "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests->smart-open>=1.8.1->gensim==3.8.1) (2019.11.28)\n",
            "Requirement already satisfied: idna<2.9,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests->smart-open>=1.8.1->gensim==3.8.1) (2.8)\n",
            "Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /usr/local/lib/python3.6/dist-packages (from botocore<1.15.0,>=1.14.15->boto3->smart-open>=1.8.1->gensim==3.8.1) (2.6.1)\n",
            "Requirement already satisfied: docutils<0.16,>=0.10 in /usr/local/lib/python3.6/dist-packages (from botocore<1.15.0,>=1.14.15->boto3->smart-open>=1.8.1->gensim==3.8.1) (0.15.2)\n",
            "Installing collected packages: gensim\n",
            "  Found existing installation: gensim 3.6.0\n",
            "    Uninstalling gensim-3.6.0:\n",
            "      Successfully uninstalled gensim-3.6.0\n",
            "Successfully installed gensim-3.8.1\n"
          ],
          "name": "stdout"
        },
        {
          "output_type": "display_data",
          "data": {
            "application/vnd.colab-display-data+json": {
              "pip_warning": {
                "packages": [
                  "gensim"
                ]
              }
            }
          },
          "metadata": {
            "tags": []
          }
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "vtSWKrx1VavY",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "import gensim"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "v_UnFRKZU56Q",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "model = gensim.models.fasttext.FastTextKeyedVectors.load('model.model')"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "2X3prObOVd-Y",
        "colab_type": "code",
        "outputId": "1262c8b9-d409-4e6d-ec07-147489be475f",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 34
        }
      },
      "source": [
        "word = 'покемошечка'\n",
        "word in model.vocab  # we are deliberately taking an OOV word to demonstrate that similarity is incorrect with ngrams"
      ],
      "execution_count": 3,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "False"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 3
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "wfZ8mNRPVoop",
        "colab_type": "code",
        "outputId": "0d01c04f-a34f-4226-b2fc-670cccc2feb7",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 187
        }
      },
      "source": [
        "model.most_similar(word)"
      ],
      "execution_count": 4,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "[('юлечка', 0.7381488680839539),\n",
              " ('лялечка', 0.7292031645774841),\n",
              " ('алечка', 0.708588182926178),\n",
              " ('кошечка', 0.7078714370727539),\n",
              " ('илюшечка', 0.7053546905517578),\n",
              " ('лешечка', 0.701703667640686),\n",
              " ('лилечка', 0.7000791430473328),\n",
              " ('сашечка', 0.6995923519134521),\n",
              " ('лёнечка', 0.6978040933609009),\n",
              " ('лелечка', 0.6871213316917419)]"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 4
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "aBN2DSmobhEl",
        "colab_type": "text"
      },
      "source": [
        "Result is:\n",
        "```\n",
        "[('юлечка', 0.7381488680839539),\n",
        " ('лялечка', 0.7292031645774841),\n",
        " ('алечка', 0.708588182926178),\n",
        " ('кошечка', 0.7078714370727539),\n",
        " ('илюшечка', 0.7053546905517578),\n",
        " ('лешечка', 0.701703667640686),\n",
        " ('лилечка', 0.7000791430473328),\n",
        " ('сашечка', 0.6995923519134521),\n",
        " ('лёнечка', 0.6978040933609009),\n",
        " ('лелечка', 0.6871213316917419)]\n",
        " ```"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "3lF4YBhNVsFX",
        "colab_type": "code",
        "outputId": "8cfa2de3-e05a-40cc-a816-9439aafb0c5b",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 34
        }
      },
      "source": [
        "model.cosine_similarities(model['юлечка'], model['покемошечка'].reshape(1, -1))"
      ],
      "execution_count": 5,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "array([0.74520236], dtype=float32)"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 5
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "prY-cPs5batJ",
        "colab_type": "text"
      },
      "source": [
        "Result is:\n",
        "```\n",
        "array([0.74520236], dtype=float32)\n",
        "```"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "hySf8Kv4YxR0",
        "colab_type": "text"
      },
      "source": [
        "What happens: cosine similarities used for neighbor retrieval are different from similarities calculated directly from word vectors. \n",
        "\n",
        "Why it happens:\n",
        "* usually when calculating vectors for OOV words fasttext calculates average of n-gram vectors\n",
        "* but if we pass `use_norm=True`, then fasttext calculates average of *L2-normalized* n-gram vectors ([code](https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/keyedvectors.py#L2090)). And it is wrong!\n",
        "* when we lookup for most similar words, we use just this option, `use_norm=True` ([code](https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/keyedvectors.py#L831)), how unfortunate!\n",
        "* why averaging normalized vectors is wrong: because it was never done when model was trained, and is normally never done when the model is applied, so such vectors are most probably meaningless.\n",
        "* how to do it right: *first* average n-gram vectors, and *then* normalize them. \n",
        "\n",
        "Call to action: rewrite `word_vec` method for FastTextKeyedVectors to apply normalization and averaging in the rigth order. "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "8rkv5ZNtWoxe",
        "colab_type": "text"
      },
      "source": [
        "Что мы видим: сходства слов, использованные при поиске, не совпадают с прямым подсчётом косинусной близости по векторам слов. "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "tk0PkBM1XW5F",
        "colab_type": "text"
      },
      "source": [
        "Теперь почему так происходит:\n",
        "* вообще-то при расчёте вектора OOV слова fasttext усредняться векторы n-грамм\n",
        "* но если указать use_norm=True, то усредняться будут L2-нормализованные векторы n-грамм. и это неправильно!\n",
        "* при расчёте most_similar как раз используется use_norm=True\n",
        "* как делать правильно: сначала складывать векторы, потом усреднять\n",
        "\n",
        "Вот код: https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/keyedvectors.py#L2090\n",
        "\n",
        "Почему то, как сейчас, неправильно: если нормализовывать векторы n-грамм перед усреднением, то каждый поделится на собственную норму (а они разные!), и среднее из них будет чем-то, чего модель не видела ни на обучении, ни (в нормальном сценарии) даже на применении. И, скорее всего, чем-то не очень осмысленным. "
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "qHo6OMtI-v0u",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 51
        },
        "outputId": "8a6d5cb7-eab0-4f7a-9c47-3f1a23068226"
      },
      "source": [
        "word = 'some_oov_word'\n",
        "pairs = model.most_similar(word)\n",
        "top_neighbor, top_simil = pairs[0]\n",
        "print(top_simil)\n",
        "print(model.cosine_similarities(model[word], model[top_neighbor].reshape(1, -1))[0])"
      ],
      "execution_count": 6,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "0.7857677936553955\n",
            "0.81707764\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "pQSCKJuT-3zf",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        ""
      ],
      "execution_count": 0,
      "outputs": []
    }
  ]
}
	{
	"nbformat": 4,
	"nbformat_minor": 0,
	"metadata": {
	"colab": {
	"name": "fasttext_similarity_weirdness.ipynb",
	"provenance": [],
	"collapsed_sections": [],
	"authorship_tag": "ABX9TyN/LQY3jVFwwrNjNQZUeOc0",
	"include_colab_link": true
	},
	"kernelspec": {
	"name": "python3",
	"display_name": "Python 3"
	}
	},
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "view-in-github",
	"colab_type": "text"
	},
	"source": [
	"<a href=\"https://colab.research.google.com/gist/avidale/c6b1d13b32a36f19750cd01148560561/fasttext_similarity_weirdness.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "GOaB-MD8XHmZ",
	"colab_type": "text"
	},
	"source": [
	"In this stub, I want to demonstrate some shit that happens when we use gensim fasttext model to search for similar words. \n",
	"\n",
	"Хочу продемонстрировать некоторое дерьмо, происходящее в gensimовской модели fasttext при поиске похожих слов."
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "Sb0Xoq8OUrUf",
	"colab_type": "code",
	"outputId": "b6f70c67-7dd8-4ac0-cb4d-4be85014f4ae",
	"colab": {
	"base_uri": "https://localhost:8080/",
	"height": 204
	}
	},
	"source": [
	"!wget http://vectors.nlpl.eu/repository/20/181.zip"
	],
	"execution_count": 5,
	"outputs": [
	{
	"output_type": "stream",
	"text": [
	"--2020-03-03 19:54:09-- http://vectors.nlpl.eu/repository/20/181.zip\n",
	"Resolving vectors.nlpl.eu (vectors.nlpl.eu)... 129.240.189.225\n",
	"Connecting to vectors.nlpl.eu (vectors.nlpl.eu)\|129.240.189.225\|:80... connected.\n",
	"HTTP request sent, awaiting response... 200 OK\n",
	"Length: 2622716250 (2.4G) [application/zip]\n",
	"Saving to: ‘181.zip’\n",
	"\n",
	"181.zip 100%[===================>] 2.44G 23.0MB/s in 1m 54s \n",
	"\n",
	"2020-03-03 19:56:09 (22.0 MB/s) - ‘181.zip’ saved [2622716250/2622716250]\n",
	"\n"
	],
	"name": "stdout"
	}
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "RCnoZQ0AUzpB",
	"colab_type": "code",
	"outputId": "8dc1ae30-e184-4dac-81eb-5b0efc79050c",
	"colab": {
	"base_uri": "https://localhost:8080/",
	"height": 136
	}
	},
	"source": [
	"!unzip 181.zip"
	],
	"execution_count": 6,
	"outputs": [
	{
	"output_type": "stream",
	"text": [
	"Archive: 181.zip\n",
	" inflating: meta.json \n",
	" inflating: model.model \n",
	" inflating: model.model.vectors_ngrams.npy \n",
	" inflating: model.model.vectors.npy \n",
	" inflating: model.model.vectors_vocab.npy \n",
	" inflating: README \n"
	],
	"name": "stdout"
	}
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "eqxHi7oUVBrL",
	"colab_type": "code",
	"outputId": "49af8c00-85a8-4f8b-e93a-b22e7b5d3187",
	"colab": {
	"base_uri": "https://localhost:8080/",
	"height": 68
	}
	},
	"source": [
	"!ls"
	],
	"execution_count": 7,
	"outputs": [
	{
	"output_type": "stream",
	"text": [
	"181.zip model.model.vectors_ngrams.npy README\n",
	"meta.json model.model.vectors.npy\t sample_data\n",
	"model.model model.model.vectors_vocab.npy\n"
	],
	"name": "stdout"
	}
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "a_lX93QSVyCl",
	"colab_type": "code",
	"outputId": "28c19cc7-6acb-467a-bed6-d26c6aaa9840",
	"colab": {
	"base_uri": "https://localhost:8080/",
	"height": 511
	}
	},
	"source": [
	"!pip install gensim==3.8.1"
	],
	"execution_count": 8,
	"outputs": [
	{
	"output_type": "stream",
	"text": [
	"Collecting gensim==3.8.1\n",
	"\u001b[?25l Downloading https://files.pythonhosted.org/packages/d1/dd/112bd4258cee11e0baaaba064060eb156475a42362e59e3ff28e7ca2d29d/gensim-3.8.1-cp36-cp36m-manylinux1_x86_64.whl (24.2MB)\n",
	"\u001b[K \|████████████████████████████████\| 24.2MB 1.6MB/s \n",
	"\u001b[?25hRequirement already satisfied: scipy>=0.18.1 in /usr/local/lib/python3.6/dist-packages (from gensim==3.8.1) (1.4.1)\n",
	"Requirement already satisfied: six>=1.5.0 in /usr/local/lib/python3.6/dist-packages (from gensim==3.8.1) (1.12.0)\n",
	"Requirement already satisfied: smart-open>=1.8.1 in /usr/local/lib/python3.6/dist-packages (from gensim==3.8.1) (1.9.0)\n",
	"Requirement already satisfied: numpy>=1.11.3 in /usr/local/lib/python3.6/dist-packages (from gensim==3.8.1) (1.17.5)\n",
	"Requirement already satisfied: boto3 in /usr/local/lib/python3.6/dist-packages (from smart-open>=1.8.1->gensim==3.8.1) (1.11.15)\n",
	"Requirement already satisfied: requests in /usr/local/lib/python3.6/dist-packages (from smart-open>=1.8.1->gensim==3.8.1) (2.21.0)\n",
	"Requirement already satisfied: boto>=2.32 in /usr/local/lib/python3.6/dist-packages (from smart-open>=1.8.1->gensim==3.8.1) (2.49.0)\n",
	"Requirement already satisfied: botocore<1.15.0,>=1.14.15 in /usr/local/lib/python3.6/dist-packages (from boto3->smart-open>=1.8.1->gensim==3.8.1) (1.14.15)\n",
	"Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /usr/local/lib/python3.6/dist-packages (from boto3->smart-open>=1.8.1->gensim==3.8.1) (0.9.4)\n",
	"Requirement already satisfied: s3transfer<0.4.0,>=0.3.0 in /usr/local/lib/python3.6/dist-packages (from boto3->smart-open>=1.8.1->gensim==3.8.1) (0.3.3)\n",
	"Requirement already satisfied: urllib3<1.25,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests->smart-open>=1.8.1->gensim==3.8.1) (1.24.3)\n",
	"Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests->smart-open>=1.8.1->gensim==3.8.1) (3.0.4)\n",
	"Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests->smart-open>=1.8.1->gensim==3.8.1) (2019.11.28)\n",
	"Requirement already satisfied: idna<2.9,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests->smart-open>=1.8.1->gensim==3.8.1) (2.8)\n",
	"Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /usr/local/lib/python3.6/dist-packages (from botocore<1.15.0,>=1.14.15->boto3->smart-open>=1.8.1->gensim==3.8.1) (2.6.1)\n",
	"Requirement already satisfied: docutils<0.16,>=0.10 in /usr/local/lib/python3.6/dist-packages (from botocore<1.15.0,>=1.14.15->boto3->smart-open>=1.8.1->gensim==3.8.1) (0.15.2)\n",
	"Installing collected packages: gensim\n",
	" Found existing installation: gensim 3.6.0\n",
	" Uninstalling gensim-3.6.0:\n",
	" Successfully uninstalled gensim-3.6.0\n",
	"Successfully installed gensim-3.8.1\n"
	],
	"name": "stdout"
	},
	{
	"output_type": "display_data",
	"data": {
	"application/vnd.colab-display-data+json": {
	"pip_warning": {
	"packages": [
	"gensim"
	]
	}
	}
	},
	"metadata": {
	"tags": []
	}
	}
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "vtSWKrx1VavY",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"import gensim"
	],
	"execution_count": 0,
	"outputs": []
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "v_UnFRKZU56Q",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"model = gensim.models.fasttext.FastTextKeyedVectors.load('model.model')"
	],
	"execution_count": 0,
	"outputs": []
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "2X3prObOVd-Y",
	"colab_type": "code",
	"outputId": "1262c8b9-d409-4e6d-ec07-147489be475f",
	"colab": {
	"base_uri": "https://localhost:8080/",
	"height": 34
	}
	},
	"source": [
	"word = 'покемошечка'\n",
	"word in model.vocab # we are deliberately taking an OOV word to demonstrate that similarity is incorrect with ngrams"
	],
	"execution_count": 3,
	"outputs": [
	{
	"output_type": "execute_result",
	"data": {
	"text/plain": [
	"False"
	]
	},
	"metadata": {
	"tags": []
	},
	"execution_count": 3
	}
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "wfZ8mNRPVoop",
	"colab_type": "code",
	"outputId": "0d01c04f-a34f-4226-b2fc-670cccc2feb7",
	"colab": {
	"base_uri": "https://localhost:8080/",
	"height": 187
	}
	},
	"source": [
	"model.most_similar(word)"
	],
	"execution_count": 4,
	"outputs": [
	{
	"output_type": "execute_result",
	"data": {
	"text/plain": [
	"[('юлечка', 0.7381488680839539),\n",
	" ('лялечка', 0.7292031645774841),\n",
	" ('алечка', 0.708588182926178),\n",
	" ('кошечка', 0.7078714370727539),\n",
	" ('илюшечка', 0.7053546905517578),\n",
	" ('лешечка', 0.701703667640686),\n",
	" ('лилечка', 0.7000791430473328),\n",
	" ('сашечка', 0.6995923519134521),\n",
	" ('лёнечка', 0.6978040933609009),\n",
	" ('лелечка', 0.6871213316917419)]"
	]
	},
	"metadata": {
	"tags": []
	},
	"execution_count": 4
	}
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "aBN2DSmobhEl",
	"colab_type": "text"
	},
	"source": [
	"Result is:\n",
	"```\n",
	"[('юлечка', 0.7381488680839539),\n",
	" ('лялечка', 0.7292031645774841),\n",
	" ('алечка', 0.708588182926178),\n",
	" ('кошечка', 0.7078714370727539),\n",
	" ('илюшечка', 0.7053546905517578),\n",
	" ('лешечка', 0.701703667640686),\n",
	" ('лилечка', 0.7000791430473328),\n",
	" ('сашечка', 0.6995923519134521),\n",
	" ('лёнечка', 0.6978040933609009),\n",
	" ('лелечка', 0.6871213316917419)]\n",
	" ```"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "3lF4YBhNVsFX",
	"colab_type": "code",
	"outputId": "8cfa2de3-e05a-40cc-a816-9439aafb0c5b",
	"colab": {
	"base_uri": "https://localhost:8080/",
	"height": 34
	}
	},
	"source": [
	"model.cosine_similarities(model['юлечка'], model['покемошечка'].reshape(1, -1))"
	],
	"execution_count": 5,
	"outputs": [
	{
	"output_type": "execute_result",
	"data": {
	"text/plain": [
	"array([0.74520236], dtype=float32)"
	]
	},
	"metadata": {
	"tags": []
	},
	"execution_count": 5
	}
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "prY-cPs5batJ",
	"colab_type": "text"
	},
	"source": [
	"Result is:\n",
	"```\n",
	"array([0.74520236], dtype=float32)\n",
	"```"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "hySf8Kv4YxR0",
	"colab_type": "text"
	},
	"source": [
	"What happens: cosine similarities used for neighbor retrieval are different from similarities calculated directly from word vectors. \n",
	"\n",
	"Why it happens:\n",
	"* usually when calculating vectors for OOV words fasttext calculates average of n-gram vectors\n",
	"* but if we pass `use_norm=True`, then fasttext calculates average of L2-normalized n-gram vectors ([code](https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/keyedvectors.py#L2090)). And it is wrong!\n",
	"* when we lookup for most similar words, we use just this option, `use_norm=True` ([code](https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/keyedvectors.py#L831)), how unfortunate!\n",
	"* why averaging normalized vectors is wrong: because it was never done when model was trained, and is normally never done when the model is applied, so such vectors are most probably meaningless.\n",
	"* how to do it right: first average n-gram vectors, and then normalize them. \n",
	"\n",
	"Call to action: rewrite `word_vec` method for FastTextKeyedVectors to apply normalization and averaging in the rigth order. "
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "8rkv5ZNtWoxe",
	"colab_type": "text"
	},
	"source": [
	"Что мы видим: сходства слов, использованные при поиске, не совпадают с прямым подсчётом косинусной близости по векторам слов. "
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "tk0PkBM1XW5F",
	"colab_type": "text"
	},
	"source": [
	"Теперь почему так происходит:\n",
	"* вообще-то при расчёте вектора OOV слова fasttext усредняться векторы n-грамм\n",
	"* но если указать use_norm=True, то усредняться будут L2-нормализованные векторы n-грамм. и это неправильно!\n",
	"* при расчёте most_similar как раз используется use_norm=True\n",
	"* как делать правильно: сначала складывать векторы, потом усреднять\n",
	"\n",
	"Вот код: https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/keyedvectors.py#L2090\n",
	"\n",
	"Почему то, как сейчас, неправильно: если нормализовывать векторы n-грамм перед усреднением, то каждый поделится на собственную норму (а они разные!), и среднее из них будет чем-то, чего модель не видела ни на обучении, ни (в нормальном сценарии) даже на применении. И, скорее всего, чем-то не очень осмысленным. "
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "qHo6OMtI-v0u",
	"colab_type": "code",
	"colab": {
	"base_uri": "https://localhost:8080/",
	"height": 51
	},
	"outputId": "8a6d5cb7-eab0-4f7a-9c47-3f1a23068226"
	},
	"source": [
	"word = 'some_oov_word'\n",
	"pairs = model.most_similar(word)\n",
	"top_neighbor, top_simil = pairs[0]\n",
	"print(top_simil)\n",
	"print(model.cosine_similarities(model[word], model[top_neighbor].reshape(1, -1))[0])"
	],
	"execution_count": 6,
	"outputs": [
	{
	"output_type": "stream",
	"text": [
	"0.7857677936553955\n",
	"0.81707764\n"
	],
	"name": "stdout"
	}
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "pQSCKJuT-3zf",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	""
	],
	"execution_count": 0,
	"outputs": []
	}
	]
	}