Skip to content

Instantly share code, notes, and snippets.

@avidale
Last active May 25, 2023 21:24
Show Gist options
  • Save avidale/c6b1d13b32a36f19750cd01148560561 to your computer and use it in GitHub Desktop.
Save avidale/c6b1d13b32a36f19750cd01148560561 to your computer and use it in GitHub Desktop.
fasttext_similarity_weirdness.ipynb
Display the source blob
Display the rendered blob
Raw
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "fasttext_similarity_weirdness.ipynb",
"provenance": [],
"collapsed_sections": [],
"authorship_tag": "ABX9TyN/LQY3jVFwwrNjNQZUeOc0",
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/avidale/c6b1d13b32a36f19750cd01148560561/fasttext_similarity_weirdness.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "GOaB-MD8XHmZ",
"colab_type": "text"
},
"source": [
"In this stub, I want to demonstrate some shit that happens when we use gensim fasttext model to search for similar words. \n",
"\n",
"Хочу продемонстрировать некоторое дерьмо, происходящее в gensimовской модели fasttext при поиске похожих слов."
]
},
{
"cell_type": "code",
"metadata": {
"id": "Sb0Xoq8OUrUf",
"colab_type": "code",
"outputId": "b6f70c67-7dd8-4ac0-cb4d-4be85014f4ae",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 204
}
},
"source": [
"!wget http://vectors.nlpl.eu/repository/20/181.zip"
],
"execution_count": 5,
"outputs": [
{
"output_type": "stream",
"text": [
"--2020-03-03 19:54:09-- http://vectors.nlpl.eu/repository/20/181.zip\n",
"Resolving vectors.nlpl.eu (vectors.nlpl.eu)... 129.240.189.225\n",
"Connecting to vectors.nlpl.eu (vectors.nlpl.eu)|129.240.189.225|:80... connected.\n",
"HTTP request sent, awaiting response... 200 OK\n",
"Length: 2622716250 (2.4G) [application/zip]\n",
"Saving to: ‘181.zip’\n",
"\n",
"181.zip 100%[===================>] 2.44G 23.0MB/s in 1m 54s \n",
"\n",
"2020-03-03 19:56:09 (22.0 MB/s) - ‘181.zip’ saved [2622716250/2622716250]\n",
"\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "RCnoZQ0AUzpB",
"colab_type": "code",
"outputId": "8dc1ae30-e184-4dac-81eb-5b0efc79050c",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 136
}
},
"source": [
"!unzip 181.zip"
],
"execution_count": 6,
"outputs": [
{
"output_type": "stream",
"text": [
"Archive: 181.zip\n",
" inflating: meta.json \n",
" inflating: model.model \n",
" inflating: model.model.vectors_ngrams.npy \n",
" inflating: model.model.vectors.npy \n",
" inflating: model.model.vectors_vocab.npy \n",
" inflating: README \n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "eqxHi7oUVBrL",
"colab_type": "code",
"outputId": "49af8c00-85a8-4f8b-e93a-b22e7b5d3187",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 68
}
},
"source": [
"!ls"
],
"execution_count": 7,
"outputs": [
{
"output_type": "stream",
"text": [
"181.zip model.model.vectors_ngrams.npy README\n",
"meta.json model.model.vectors.npy\t sample_data\n",
"model.model model.model.vectors_vocab.npy\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "a_lX93QSVyCl",
"colab_type": "code",
"outputId": "28c19cc7-6acb-467a-bed6-d26c6aaa9840",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 511
}
},
"source": [
"!pip install gensim==3.8.1"
],
"execution_count": 8,
"outputs": [
{
"output_type": "stream",
"text": [
"Collecting gensim==3.8.1\n",
"\u001b[?25l Downloading https://files.pythonhosted.org/packages/d1/dd/112bd4258cee11e0baaaba064060eb156475a42362e59e3ff28e7ca2d29d/gensim-3.8.1-cp36-cp36m-manylinux1_x86_64.whl (24.2MB)\n",
"\u001b[K |████████████████████████████████| 24.2MB 1.6MB/s \n",
"\u001b[?25hRequirement already satisfied: scipy>=0.18.1 in /usr/local/lib/python3.6/dist-packages (from gensim==3.8.1) (1.4.1)\n",
"Requirement already satisfied: six>=1.5.0 in /usr/local/lib/python3.6/dist-packages (from gensim==3.8.1) (1.12.0)\n",
"Requirement already satisfied: smart-open>=1.8.1 in /usr/local/lib/python3.6/dist-packages (from gensim==3.8.1) (1.9.0)\n",
"Requirement already satisfied: numpy>=1.11.3 in /usr/local/lib/python3.6/dist-packages (from gensim==3.8.1) (1.17.5)\n",
"Requirement already satisfied: boto3 in /usr/local/lib/python3.6/dist-packages (from smart-open>=1.8.1->gensim==3.8.1) (1.11.15)\n",
"Requirement already satisfied: requests in /usr/local/lib/python3.6/dist-packages (from smart-open>=1.8.1->gensim==3.8.1) (2.21.0)\n",
"Requirement already satisfied: boto>=2.32 in /usr/local/lib/python3.6/dist-packages (from smart-open>=1.8.1->gensim==3.8.1) (2.49.0)\n",
"Requirement already satisfied: botocore<1.15.0,>=1.14.15 in /usr/local/lib/python3.6/dist-packages (from boto3->smart-open>=1.8.1->gensim==3.8.1) (1.14.15)\n",
"Requirement already satisfied: jmespath<1.0.0,>=0.7.1 in /usr/local/lib/python3.6/dist-packages (from boto3->smart-open>=1.8.1->gensim==3.8.1) (0.9.4)\n",
"Requirement already satisfied: s3transfer<0.4.0,>=0.3.0 in /usr/local/lib/python3.6/dist-packages (from boto3->smart-open>=1.8.1->gensim==3.8.1) (0.3.3)\n",
"Requirement already satisfied: urllib3<1.25,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests->smart-open>=1.8.1->gensim==3.8.1) (1.24.3)\n",
"Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests->smart-open>=1.8.1->gensim==3.8.1) (3.0.4)\n",
"Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests->smart-open>=1.8.1->gensim==3.8.1) (2019.11.28)\n",
"Requirement already satisfied: idna<2.9,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests->smart-open>=1.8.1->gensim==3.8.1) (2.8)\n",
"Requirement already satisfied: python-dateutil<3.0.0,>=2.1 in /usr/local/lib/python3.6/dist-packages (from botocore<1.15.0,>=1.14.15->boto3->smart-open>=1.8.1->gensim==3.8.1) (2.6.1)\n",
"Requirement already satisfied: docutils<0.16,>=0.10 in /usr/local/lib/python3.6/dist-packages (from botocore<1.15.0,>=1.14.15->boto3->smart-open>=1.8.1->gensim==3.8.1) (0.15.2)\n",
"Installing collected packages: gensim\n",
" Found existing installation: gensim 3.6.0\n",
" Uninstalling gensim-3.6.0:\n",
" Successfully uninstalled gensim-3.6.0\n",
"Successfully installed gensim-3.8.1\n"
],
"name": "stdout"
},
{
"output_type": "display_data",
"data": {
"application/vnd.colab-display-data+json": {
"pip_warning": {
"packages": [
"gensim"
]
}
}
},
"metadata": {
"tags": []
}
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "vtSWKrx1VavY",
"colab_type": "code",
"colab": {}
},
"source": [
"import gensim"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "v_UnFRKZU56Q",
"colab_type": "code",
"colab": {}
},
"source": [
"model = gensim.models.fasttext.FastTextKeyedVectors.load('model.model')"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "2X3prObOVd-Y",
"colab_type": "code",
"outputId": "1262c8b9-d409-4e6d-ec07-147489be475f",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
}
},
"source": [
"word = 'покемошечка'\n",
"word in model.vocab # we are deliberately taking an OOV word to demonstrate that similarity is incorrect with ngrams"
],
"execution_count": 3,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"False"
]
},
"metadata": {
"tags": []
},
"execution_count": 3
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "wfZ8mNRPVoop",
"colab_type": "code",
"outputId": "0d01c04f-a34f-4226-b2fc-670cccc2feb7",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 187
}
},
"source": [
"model.most_similar(word)"
],
"execution_count": 4,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"[('юлечка', 0.7381488680839539),\n",
" ('лялечка', 0.7292031645774841),\n",
" ('алечка', 0.708588182926178),\n",
" ('кошечка', 0.7078714370727539),\n",
" ('илюшечка', 0.7053546905517578),\n",
" ('лешечка', 0.701703667640686),\n",
" ('лилечка', 0.7000791430473328),\n",
" ('сашечка', 0.6995923519134521),\n",
" ('лёнечка', 0.6978040933609009),\n",
" ('лелечка', 0.6871213316917419)]"
]
},
"metadata": {
"tags": []
},
"execution_count": 4
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "aBN2DSmobhEl",
"colab_type": "text"
},
"source": [
"Result is:\n",
"```\n",
"[('юлечка', 0.7381488680839539),\n",
" ('лялечка', 0.7292031645774841),\n",
" ('алечка', 0.708588182926178),\n",
" ('кошечка', 0.7078714370727539),\n",
" ('илюшечка', 0.7053546905517578),\n",
" ('лешечка', 0.701703667640686),\n",
" ('лилечка', 0.7000791430473328),\n",
" ('сашечка', 0.6995923519134521),\n",
" ('лёнечка', 0.6978040933609009),\n",
" ('лелечка', 0.6871213316917419)]\n",
" ```"
]
},
{
"cell_type": "code",
"metadata": {
"id": "3lF4YBhNVsFX",
"colab_type": "code",
"outputId": "8cfa2de3-e05a-40cc-a816-9439aafb0c5b",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
}
},
"source": [
"model.cosine_similarities(model['юлечка'], model['покемошечка'].reshape(1, -1))"
],
"execution_count": 5,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"array([0.74520236], dtype=float32)"
]
},
"metadata": {
"tags": []
},
"execution_count": 5
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "prY-cPs5batJ",
"colab_type": "text"
},
"source": [
"Result is:\n",
"```\n",
"array([0.74520236], dtype=float32)\n",
"```"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "hySf8Kv4YxR0",
"colab_type": "text"
},
"source": [
"What happens: cosine similarities used for neighbor retrieval are different from similarities calculated directly from word vectors. \n",
"\n",
"Why it happens:\n",
"* usually when calculating vectors for OOV words fasttext calculates average of n-gram vectors\n",
"* but if we pass `use_norm=True`, then fasttext calculates average of *L2-normalized* n-gram vectors ([code](https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/keyedvectors.py#L2090)). And it is wrong!\n",
"* when we lookup for most similar words, we use just this option, `use_norm=True` ([code](https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/keyedvectors.py#L831)), how unfortunate!\n",
"* why averaging normalized vectors is wrong: because it was never done when model was trained, and is normally never done when the model is applied, so such vectors are most probably meaningless.\n",
"* how to do it right: *first* average n-gram vectors, and *then* normalize them. \n",
"\n",
"Call to action: rewrite `word_vec` method for FastTextKeyedVectors to apply normalization and averaging in the rigth order. "
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "8rkv5ZNtWoxe",
"colab_type": "text"
},
"source": [
"Что мы видим: сходства слов, использованные при поиске, не совпадают с прямым подсчётом косинусной близости по векторам слов. "
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "tk0PkBM1XW5F",
"colab_type": "text"
},
"source": [
"Теперь почему так происходит:\n",
"* вообще-то при расчёте вектора OOV слова fasttext усредняться векторы n-грамм\n",
"* но если указать use_norm=True, то усредняться будут L2-нормализованные векторы n-грамм. и это неправильно!\n",
"* при расчёте most_similar как раз используется use_norm=True\n",
"* как делать правильно: сначала складывать векторы, потом усреднять\n",
"\n",
"Вот код: https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/keyedvectors.py#L2090\n",
"\n",
"Почему то, как сейчас, неправильно: если нормализовывать векторы n-грамм перед усреднением, то каждый поделится на собственную норму (а они разные!), и среднее из них будет чем-то, чего модель не видела ни на обучении, ни (в нормальном сценарии) даже на применении. И, скорее всего, чем-то не очень осмысленным. "
]
},
{
"cell_type": "code",
"metadata": {
"id": "qHo6OMtI-v0u",
"colab_type": "code",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 51
},
"outputId": "8a6d5cb7-eab0-4f7a-9c47-3f1a23068226"
},
"source": [
"word = 'some_oov_word'\n",
"pairs = model.most_similar(word)\n",
"top_neighbor, top_simil = pairs[0]\n",
"print(top_simil)\n",
"print(model.cosine_similarities(model[word], model[top_neighbor].reshape(1, -1))[0])"
],
"execution_count": 6,
"outputs": [
{
"output_type": "stream",
"text": [
"0.7857677936553955\n",
"0.81707764\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "pQSCKJuT-3zf",
"colab_type": "code",
"colab": {}
},
"source": [
""
],
"execution_count": 0,
"outputs": []
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment