asehmi/topic-models-v4.ipynb

## topic-models-v4.ipynb
{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "accelerator": "TPU",
    "colab": {
      "name": "Topic Models - v4.ipynb",
      "provenance": [],
      "collapsed_sections": [
        "kXV0n8HsBbSj",
        "qsA7QfydBCBf",
        "vfKOji3nQXGd",
        "foq0OUD4Vf_p"
      ],
      "toc_visible": true,
      "machine_shape": "hm",
      "include_colab_link": true
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "name": "python"
    },
    "widgets": {
      "application/vnd.jupyter.widget-state+json": {
        "87abe54fb5fd4b159ce9b60e1713db71": {
          "model_module": "@jupyter-widgets/controls",
          "model_name": "HBoxModel",
          "state": {
            "_view_name": "HBoxView",
            "_dom_classes": [],
            "_model_name": "HBoxModel",
            "_view_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_view_count": null,
            "_view_module_version": "1.5.0",
            "box_style": "",
            "layout": "IPY_MODEL_337b0665c9f844be89cfac7fd6037392",
            "_model_module": "@jupyter-widgets/controls",
            "children": [
              "IPY_MODEL_2f71d5a82fb54fd5ab44f990b58d1594",
              "IPY_MODEL_0fd1aab5dde0417381d2f3d8a299abe4"
            ]
          },
          "model_module_version": "1.5.0"
        },
        "337b0665c9f844be89cfac7fd6037392": {
          "model_module": "@jupyter-widgets/base",
          "model_name": "LayoutModel",
          "state": {
            "_view_name": "LayoutView",
            "grid_template_rows": null,
            "right": null,
            "justify_content": null,
            "_view_module": "@jupyter-widgets/base",
            "overflow": null,
            "_model_module_version": "1.2.0",
            "_view_count": null,
            "flex_flow": null,
            "width": null,
            "min_width": null,
            "border": null,
            "align_items": null,
            "bottom": null,
            "_model_module": "@jupyter-widgets/base",
            "top": null,
            "grid_column": null,
            "overflow_y": null,
            "overflow_x": null,
            "grid_auto_flow": null,
            "grid_area": null,
            "grid_template_columns": null,
            "flex": null,
            "_model_name": "LayoutModel",
            "justify_items": null,
            "grid_row": null,
            "max_height": null,
            "align_content": null,
            "visibility": null,
            "align_self": null,
            "height": null,
            "min_height": null,
            "padding": null,
            "grid_auto_rows": null,
            "grid_gap": null,
            "max_width": null,
            "order": null,
            "_view_module_version": "1.2.0",
            "grid_template_areas": null,
            "object_position": null,
            "object_fit": null,
            "grid_auto_columns": null,
            "margin": null,
            "display": null,
            "left": null
          },
          "model_module_version": "1.2.0"
        },
        "2f71d5a82fb54fd5ab44f990b58d1594": {
          "model_module": "@jupyter-widgets/controls",
          "model_name": "FloatProgressModel",
          "state": {
            "_view_name": "ProgressView",
            "style": "IPY_MODEL_a6f01478747c448ea50bef18d5e10376",
            "_dom_classes": [],
            "description": "overall hyperparam opt progress:  42%",
            "_model_name": "FloatProgressModel",
            "bar_style": "",
            "max": 200,
            "_view_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "value": 83,
            "_view_count": null,
            "_view_module_version": "1.5.0",
            "orientation": "horizontal",
            "min": 0,
            "description_tooltip": null,
            "_model_module": "@jupyter-widgets/controls",
            "layout": "IPY_MODEL_3bc5275189a448ab96b3c6b89a73f650"
          },
          "model_module_version": "1.5.0"
        },
        "0fd1aab5dde0417381d2f3d8a299abe4": {
          "model_module": "@jupyter-widgets/controls",
          "model_name": "HTMLModel",
          "state": {
            "_view_name": "HTMLView",
            "style": "IPY_MODEL_018ea19333c34a858be68be671450c15",
            "_dom_classes": [],
            "description": "",
            "_model_name": "HTMLModel",
            "placeholder": "",
            "_view_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "value": " 83/200 [1:39:21&lt;2:28:23, 76.10s/it]",
            "_view_count": null,
            "_view_module_version": "1.5.0",
            "description_tooltip": null,
            "_model_module": "@jupyter-widgets/controls",
            "layout": "IPY_MODEL_ed125ff112dd4621a3f01c021758fc2a"
          },
          "model_module_version": "1.5.0"
        },
        "a6f01478747c448ea50bef18d5e10376": {
          "model_module": "@jupyter-widgets/controls",
          "model_name": "ProgressStyleModel",
          "state": {
            "_view_name": "StyleView",
            "_model_name": "ProgressStyleModel",
            "description_width": "initial",
            "_view_module": "@jupyter-widgets/base",
            "_model_module_version": "1.5.0",
            "_view_count": null,
            "_view_module_version": "1.2.0",
            "bar_color": null,
            "_model_module": "@jupyter-widgets/controls"
          },
          "model_module_version": "1.5.0"
        },
        "3bc5275189a448ab96b3c6b89a73f650": {
          "model_module": "@jupyter-widgets/base",
          "model_name": "LayoutModel",
          "state": {
            "_view_name": "LayoutView",
            "grid_template_rows": null,
            "right": null,
            "justify_content": null,
            "_view_module": "@jupyter-widgets/base",
            "overflow": null,
            "_model_module_version": "1.2.0",
            "_view_count": null,
            "flex_flow": null,
            "width": null,
            "min_width": null,
            "border": null,
            "align_items": null,
            "bottom": null,
            "_model_module": "@jupyter-widgets/base",
            "top": null,
            "grid_column": null,
            "overflow_y": null,
            "overflow_x": null,
            "grid_auto_flow": null,
            "grid_area": null,
            "grid_template_columns": null,
            "flex": null,
            "_model_name": "LayoutModel",
            "justify_items": null,
            "grid_row": null,
            "max_height": null,
            "align_content": null,
            "visibility": null,
            "align_self": null,
            "height": null,
            "min_height": null,
            "padding": null,
            "grid_auto_rows": null,
            "grid_gap": null,
            "max_width": null,
            "order": null,
            "_view_module_version": "1.2.0",
            "grid_template_areas": null,
            "object_position": null,
            "object_fit": null,
            "grid_auto_columns": null,
            "margin": null,
            "display": null,
            "left": null
          },
          "model_module_version": "1.2.0"
        },
        "018ea19333c34a858be68be671450c15": {
          "model_module": "@jupyter-widgets/controls",
          "model_name": "DescriptionStyleModel",
          "state": {
            "_view_name": "StyleView",
            "_model_name": "DescriptionStyleModel",
            "description_width": "",
            "_view_module": "@jupyter-widgets/base",
            "_model_module_version": "1.5.0",
            "_view_count": null,
            "_view_module_version": "1.2.0",
            "_model_module": "@jupyter-widgets/controls"
          },
          "model_module_version": "1.5.0"
        },
        "ed125ff112dd4621a3f01c021758fc2a": {
          "model_module": "@jupyter-widgets/base",
          "model_name": "LayoutModel",
          "state": {
            "_view_name": "LayoutView",
            "grid_template_rows": null,
            "right": null,
            "justify_content": null,
            "_view_module": "@jupyter-widgets/base",
            "overflow": null,
            "_model_module_version": "1.2.0",
            "_view_count": null,
            "flex_flow": null,
            "width": null,
            "min_width": null,
            "border": null,
            "align_items": null,
            "bottom": null,
            "_model_module": "@jupyter-widgets/base",
            "top": null,
            "grid_column": null,
            "overflow_y": null,
            "overflow_x": null,
            "grid_auto_flow": null,
            "grid_area": null,
            "grid_template_columns": null,
            "flex": null,
            "_model_name": "LayoutModel",
            "justify_items": null,
            "grid_row": null,
            "max_height": null,
            "align_content": null,
            "visibility": null,
            "align_self": null,
            "height": null,
            "min_height": null,
            "padding": null,
            "grid_auto_rows": null,
            "grid_gap": null,
            "max_width": null,
            "order": null,
            "_view_module_version": "1.2.0",
            "grid_template_areas": null,
            "object_position": null,
            "object_fit": null,
            "grid_auto_columns": null,
            "margin": null,
            "display": null,
            "left": null
          },
          "model_module_version": "1.2.0"
        },
        "da4a0d7763fd4bfbb645f9869e5eb4bd": {
          "model_module": "@jupyter-widgets/controls",
          "model_name": "HBoxModel",
          "state": {
            "_view_name": "HBoxView",
            "_dom_classes": [],
            "_model_name": "HBoxModel",
            "_view_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "_view_count": null,
            "_view_module_version": "1.5.0",
            "box_style": "",
            "layout": "IPY_MODEL_b1d1848653dd4507ba0f8dd043fe56d3",
            "_model_module": "@jupyter-widgets/controls",
            "children": [
              "IPY_MODEL_38d72a88a18b4f5e874b38762b0ec5c8",
              "IPY_MODEL_36d37021dc624db1971f0ec8d84d82d6"
            ]
          },
          "model_module_version": "1.5.0"
        },
        "b1d1848653dd4507ba0f8dd043fe56d3": {
          "model_module": "@jupyter-widgets/base",
          "model_name": "LayoutModel",
          "state": {
            "_view_name": "LayoutView",
            "grid_template_rows": null,
            "right": null,
            "justify_content": null,
            "_view_module": "@jupyter-widgets/base",
            "overflow": null,
            "_model_module_version": "1.2.0",
            "_view_count": null,
            "flex_flow": null,
            "width": null,
            "min_width": null,
            "border": null,
            "align_items": null,
            "bottom": null,
            "_model_module": "@jupyter-widgets/base",
            "top": null,
            "grid_column": null,
            "overflow_y": null,
            "overflow_x": null,
            "grid_auto_flow": null,
            "grid_area": null,
            "grid_template_columns": null,
            "flex": null,
            "_model_name": "LayoutModel",
            "justify_items": null,
            "grid_row": null,
            "max_height": null,
            "align_content": null,
            "visibility": null,
            "align_self": null,
            "height": null,
            "min_height": null,
            "padding": null,
            "grid_auto_rows": null,
            "grid_gap": null,
            "max_width": null,
            "order": null,
            "_view_module_version": "1.2.0",
            "grid_template_areas": null,
            "object_position": null,
            "object_fit": null,
            "grid_auto_columns": null,
            "margin": null,
            "display": null,
            "left": null
          },
          "model_module_version": "1.2.0"
        },
        "38d72a88a18b4f5e874b38762b0ec5c8": {
          "model_module": "@jupyter-widgets/controls",
          "model_name": "FloatProgressModel",
          "state": {
            "_view_name": "ProgressView",
            "style": "IPY_MODEL_0b6923cdf6664b4da6cd30f791954a75",
            "_dom_classes": [],
            "description": " 41%",
            "_model_name": "FloatProgressModel",
            "bar_style": "",
            "max": 200,
            "_view_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "value": 82,
            "_view_count": null,
            "_view_module_version": "1.5.0",
            "orientation": "horizontal",
            "min": 0,
            "description_tooltip": null,
            "_model_module": "@jupyter-widgets/controls",
            "layout": "IPY_MODEL_aba682afae59467b9ceed3b4d4850049"
          },
          "model_module_version": "1.5.0"
        },
        "36d37021dc624db1971f0ec8d84d82d6": {
          "model_module": "@jupyter-widgets/controls",
          "model_name": "HTMLModel",
          "state": {
            "_view_name": "HTMLView",
            "style": "IPY_MODEL_94abba656356419ba228c7f0ce2cd8f8",
            "_dom_classes": [],
            "description": "",
            "_model_name": "HTMLModel",
            "placeholder": "",
            "_view_module": "@jupyter-widgets/controls",
            "_model_module_version": "1.5.0",
            "value": " 82/200 [1:39:21&lt;2:29:39, 76.10s/it, 5961.59/18000 seconds]",
            "_view_count": null,
            "_view_module_version": "1.5.0",
            "description_tooltip": null,
            "_model_module": "@jupyter-widgets/controls",
            "layout": "IPY_MODEL_9804e36448a742bd8be143f664d6943a"
          },
          "model_module_version": "1.5.0"
        },
        "0b6923cdf6664b4da6cd30f791954a75": {
          "model_module": "@jupyter-widgets/controls",
          "model_name": "ProgressStyleModel",
          "state": {
            "_view_name": "StyleView",
            "_model_name": "ProgressStyleModel",
            "description_width": "initial",
            "_view_module": "@jupyter-widgets/base",
            "_model_module_version": "1.5.0",
            "_view_count": null,
            "_view_module_version": "1.2.0",
            "bar_color": null,
            "_model_module": "@jupyter-widgets/controls"
          },
          "model_module_version": "1.5.0"
        },
        "aba682afae59467b9ceed3b4d4850049": {
          "model_module": "@jupyter-widgets/base",
          "model_name": "LayoutModel",
          "state": {
            "_view_name": "LayoutView",
            "grid_template_rows": null,
            "right": null,
            "justify_content": null,
            "_view_module": "@jupyter-widgets/base",
            "overflow": null,
            "_model_module_version": "1.2.0",
            "_view_count": null,
            "flex_flow": null,
            "width": null,
            "min_width": null,
            "border": null,
            "align_items": null,
            "bottom": null,
            "_model_module": "@jupyter-widgets/base",
            "top": null,
            "grid_column": null,
            "overflow_y": null,
            "overflow_x": null,
            "grid_auto_flow": null,
            "grid_area": null,
            "grid_template_columns": null,
            "flex": null,
            "_model_name": "LayoutModel",
            "justify_items": null,
            "grid_row": null,
            "max_height": null,
            "align_content": null,
            "visibility": null,
            "align_self": null,
            "height": null,
            "min_height": null,
            "padding": null,
            "grid_auto_rows": null,
            "grid_gap": null,
            "max_width": null,
            "order": null,
            "_view_module_version": "1.2.0",
            "grid_template_areas": null,
            "object_position": null,
            "object_fit": null,
            "grid_auto_columns": null,
            "margin": null,
            "display": null,
            "left": null
          },
          "model_module_version": "1.2.0"
        },
        "94abba656356419ba228c7f0ce2cd8f8": {
          "model_module": "@jupyter-widgets/controls",
          "model_name": "DescriptionStyleModel",
          "state": {
            "_view_name": "StyleView",
            "_model_name": "DescriptionStyleModel",
            "description_width": "",
            "_view_module": "@jupyter-widgets/base",
            "_model_module_version": "1.5.0",
            "_view_count": null,
            "_view_module_version": "1.2.0",
            "_model_module": "@jupyter-widgets/controls"
          },
          "model_module_version": "1.5.0"
        },
        "9804e36448a742bd8be143f664d6943a": {
          "model_module": "@jupyter-widgets/base",
          "model_name": "LayoutModel",
          "state": {
            "_view_name": "LayoutView",
            "grid_template_rows": null,
            "right": null,
            "justify_content": null,
            "_view_module": "@jupyter-widgets/base",
            "overflow": null,
            "_model_module_version": "1.2.0",
            "_view_count": null,
            "flex_flow": null,
            "width": null,
            "min_width": null,
            "border": null,
            "align_items": null,
            "bottom": null,
            "_model_module": "@jupyter-widgets/base",
            "top": null,
            "grid_column": null,
            "overflow_y": null,
            "overflow_x": null,
            "grid_auto_flow": null,
            "grid_area": null,
            "grid_template_columns": null,
            "flex": null,
            "_model_name": "LayoutModel",
            "justify_items": null,
            "grid_row": null,
            "max_height": null,
            "align_content": null,
            "visibility": null,
            "align_self": null,
            "height": null,
            "min_height": null,
            "padding": null,
            "grid_auto_rows": null,
            "grid_gap": null,
            "max_width": null,
            "order": null,
            "_view_module_version": "1.2.0",
            "grid_template_areas": null,
            "object_position": null,
            "object_fit": null,
            "grid_auto_columns": null,
            "margin": null,
            "display": null,
            "left": null
          },
          "model_module_version": "1.2.0"
        }
      }
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/gist/pszemraj/d3f39284b53ce8b322bb9aa0e4bbd2f8/topic-models-v4.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "mkR0sDci3hpT"
      },
      "source": [
        "# Topic Models - Optimize and Plot\n",
        "\n",
        "Now evaluate the best topic model. Source code (that I edited) also taken from [medium](https://towardsdatascience.com/evaluate-topic-model-in-python-latent-dirichlet-allocation-lda-7d57484bb5d0). "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "oX1xqld-cCLn"
      },
      "source": [
        "\n",
        "\n",
        "#### What is topic coherence?\n",
        "Topic Coherence measures score a single topic by measuring the degree of semantic similarity between high scoring words in the topic. These measurements help distinguish between topics that are semantically interpretable topics and topics that are artifacts of statistical inference.\n",
        "\n",
        "\n",
        "**Coherence:**\n",
        "A set of statements or facts is said to be coherent, if they support each other. Thus, a coherent fact set can be interpreted in a context that covers all or most of the facts. An example of a coherent fact set is \"the game is a team sport\", \"the game is played with a ball\", \"the game demands great physical efforts\"\n",
        "\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "hFVpSiLsyTx0"
      },
      "source": [
        "** **\n",
        "# Loading Data\n",
        "** **"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "kXV0n8HsBbSj"
      },
      "source": [
        "### Setup display"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "oellEGZOBdEX"
      },
      "source": [
        "from IPython.display import HTML, display\n",
        "\n",
        "def set_css():\n",
        "  display(HTML('''\n",
        "  <style>\n",
        "    pre {\n",
        "        white-space: pre-wrap;\n",
        "    }\n",
        "  </style>\n",
        "  '''))\n",
        "get_ipython().events.register('pre_run_cell', set_css)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "qsA7QfydBCBf"
      },
      "source": [
        "### mount google drive"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Ht-RfLy3A6r8",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "9a380d91-f2d1-4590-dfa7-5e8e6df666c7"
      },
      "source": [
        "# create interface to upload / interact with google drive and video files\n",
        "from google.colab import files\n",
        "from google.colab import drive\n",
        "drive.mount('/content/drive')"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "\n",
              "  <style>\n",
              "    pre {\n",
              "        white-space: pre-wrap;\n",
              "    }\n",
              "  </style>\n",
              "  "
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {
            "tags": []
          }
        },
        {
          "output_type": "stream",
          "text": [
            "Mounted at /content/drive\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "QMSrIAI6BD7g"
      },
      "source": [
        "### install packages\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "QBThSIrycCLt",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 17
        },
        "outputId": "c79e3dbe-5fa5-43b9-a3e5-f91cb4f329be"
      },
      "source": [
        "%%capture\n",
        "# Importing base modules\n",
        "!pip install -U spacy\n",
        "!pip install -U natsort\n",
        "!pip install -U texthero\n",
        "!pip install -U wordninja\n",
        "\n",
        "import os\n",
        "import pprint as pp\n",
        "from datetime import datetime\n",
        "from os import listdir\n",
        "from os.path import isfile, join\n",
        "import time\n",
        "from datetime import datetime\n",
        "import pandas as pd\n",
        "import spacy\n",
        "import re\n",
        "import texthero as hero\n",
        "from texthero import preprocessing\n",
        "import wordninja\n",
        "\n",
        "from natsort import natsorted\n",
        "from natsort import *\n",
        "from google.colab import data_table\n",
        "\n",
        "\n",
        "run_date = datetime.now()\n",
        "day_suffix = run_date.strftime(\"_%d%m%Y_\")\n"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "\n",
              "  <style>\n",
              "    pre {\n",
              "        white-space: pre-wrap;\n",
              "    }\n",
              "  </style>\n",
              "  "
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {
            "tags": []
          }
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "HRUIqJgU7N56"
      },
      "source": [
        "### define filename + other key params\n",
        "\n",
        "read in data as well"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "WIC8T0-j8j7y",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 17
        },
        "outputId": "6db366f1-bafd-451a-db7c-dd71c75d231e"
      },
      "source": [
        "# KEY USER ENTERED PARAMETERS\n",
        "dataset_name = \"NLP - completed course topics\" + day_suffix\n",
        "desired_trials = 200 # th\n",
        "file_directory = \"/content/drive/My Drive/Programming/topic_models\"\n",
        "subfolder = \"nlp_2021\"\n",
        "# filename = 'nlp Raw Text database (before NLP and Cleaning).ftr'\n",
        "filename = 'nlp multi-directory compilation.ftr'\n",
        "# Read data into papers (need to rename later)\n",
        "out_directory = join(file_directory, subfolder)\n",
        "papers = pd.read_feather(os.path.join(file_directory, subfolder, filename))\n",
        "# papers_sorted = papers.sort_values(key=natsort_keygen(), by=\"doc name\")"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "\n",
              "  <style>\n",
              "    pre {\n",
              "        white-space: pre-wrap;\n",
              "    }\n",
              "  </style>\n",
              "  "
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {
            "tags": []
          }
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "GtYq2_wmcCLv"
      },
      "source": [
        "** **\n",
        "## Data Cleaning\n",
        "** **\n",
        "\n",
        "drop unneeded columns, get rid of punctuation, etc"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "FprB-OzLcCLv",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 423
        },
        "outputId": "11365619-89b8-4fed-e80c-689a836ef18b"
      },
      "source": [
        "%load_ext google.colab.data_table\n",
        "\n",
        "# get basic description of dataset\n",
        "papers = papers.convert_dtypes() # convert data types after loading\n",
        "pp.pprint(papers.info())\n"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "\n",
              "  <style>\n",
              "    pre {\n",
              "        white-space: pre-wrap;\n",
              "    }\n",
              "  </style>\n",
              "  "
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {
            "tags": []
          }
        },
        {
          "output_type": "stream",
          "text": [
            "<class 'pandas.core.frame.DataFrame'>\n",
            "RangeIndex: 263 entries, 0 to 262\n",
            "Data columns (total 15 columns):\n",
            " #   Column              Non-Null Count  Dtype \n",
            "---  ------              --------------  ----- \n",
            " 0   text_in_doc         263 non-null    string\n",
            " 1   doc name            263 non-null    string\n",
            " 2   Directory Name      263 non-null    string\n",
            " 3   Directory Address   263 non-null    string\n",
            " 4   Directory Grouping  263 non-null    string\n",
            " 5   pca                 263 non-null    object\n",
            " 6   cleaner             263 non-null    string\n",
            " 7   cleanest            263 non-null    string\n",
            " 8   pca - sw removed    263 non-null    object\n",
            " 9   tfidf               263 non-null    object\n",
            " 10  kmeans_labels       263 non-null    string\n",
            " 11  TSNE_points         263 non-null    object\n",
            " 12  use                 263 non-null    object\n",
            " 13  TSNE_USE            263 non-null    object\n",
            " 14  kmeans_USE          263 non-null    string\n",
            "dtypes: object(6), string(9)\n",
            "memory usage: 30.9+ KB\n",
            "None\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "MJAnzoAt9B3e",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 335
        },
        "outputId": "51335a34-aebd-4ae5-a46e-204556dfc62b"
      },
      "source": [
        "# Remove the columns\n",
        "\n",
        "papers.drop(columns=['pca', 'cleaner', 'tfidf', 'TSNE_points'], \n",
        "            axis=1, inplace=True)\n",
        "# papers.drop(columns='Attachment', inplace=True)\n",
        "# papers['cleanest'] = papers['clean_text'].copy()\n",
        "# papers['doc name'] = papers['Sender Name'].copy()\n",
        "# papers['text_in_doc'] = papers['Text'].copy()\n",
        "# Print out the first rows of papers after making changes\n",
        "papers.info()"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "\n",
              "  <style>\n",
              "    pre {\n",
              "        white-space: pre-wrap;\n",
              "    }\n",
              "  </style>\n",
              "  "
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {
            "tags": []
          }
        },
        {
          "output_type": "stream",
          "text": [
            "<class 'pandas.core.frame.DataFrame'>\n",
            "RangeIndex: 263 entries, 0 to 262\n",
            "Data columns (total 11 columns):\n",
            " #   Column              Non-Null Count  Dtype \n",
            "---  ------              --------------  ----- \n",
            " 0   text_in_doc         263 non-null    string\n",
            " 1   doc name            263 non-null    string\n",
            " 2   Directory Name      263 non-null    string\n",
            " 3   Directory Address   263 non-null    string\n",
            " 4   Directory Grouping  263 non-null    string\n",
            " 5   cleanest            263 non-null    string\n",
            " 6   pca - sw removed    263 non-null    object\n",
            " 7   kmeans_labels       263 non-null    string\n",
            " 8   use                 263 non-null    object\n",
            " 9   TSNE_USE            263 non-null    object\n",
            " 10  kmeans_USE          263 non-null    string\n",
            "dtypes: object(3), string(8)\n",
            "memory usage: 22.7+ KB\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "wvebDCkKcCLw"
      },
      "source": [
        "### Remove punctuation/lower casing\n",
        "\n",
        "Next, let’s perform a simple preprocessing on the content of paper_text column to make them more amenable for analysis, and reliable results. To do that, we’ll use a regular expression to remove any punctuation, and then lowercase the text"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "YJ45p-DNcCLx",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 211
        },
        "outputId": "4524adc0-2093-41bd-a080-23200af075ed"
      },
      "source": [
        "# Load the regular expression library\n",
        "import re\n",
        "\n",
        "# Remove punctuation\n",
        "papers['paper_text_processed'] = papers['text_in_doc'].map(lambda x: re.sub('[,\\.!?]', '', x))\n",
        "\n",
        "# Convert the titles to lowercase\n",
        "papers['paper_text_processed'] = papers['text_in_doc'].map(lambda x: x.lower())\n",
        "\n",
        "# Print out the first rows of papers\n",
        "papers['paper_text_processed'].head(10)"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "\n",
              "  <style>\n",
              "    pre {\n",
              "        white-space: pre-wrap;\n",
              "    }\n",
              "  </style>\n",
              "  "
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {
            "tags": []
          }
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "0    page 1 discrete event systems - verification o...\n",
              "1    page 1 discrete event systems - verification o...\n",
              "2    page 1 discrete event systems - petri nets lot...\n",
              "3    page 1 discrete event systems - petri nets lot...\n",
              "4    page 1 discrete event systems laurent vanbever...\n",
              "5    page 1 automata languages laurent vanbever nsg...\n",
              "6    page 1 automata languages laurent vanbever nsg...\n",
              "7    page 1 automata languages roland schmid nsg. e...\n",
              "8    page 1 discrete event systems discrete event s...\n",
              "9    page 1 automata languages a aq r aa i a a a a ...\n",
              "Name: paper_text_processed, dtype: object"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 7
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "OKYzjRi7ZtKT"
      },
      "source": [
        "# Prepare and Optimize Model"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "wm5ZXtb1cCLx"
      },
      "source": [
        "##### Tokenize words \n",
        "\n",
        "Let’s tokenize each sentence into a list of words, removing punctuations and unnecessary characters altogether."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "fMhensIZcCLy",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 70
        },
        "outputId": "aaa5d5b9-7154-42c7-9eb0-415a8680a8d3"
      },
      "source": [
        "import gensim\n",
        "from gensim.utils import simple_preprocess\n",
        "\n",
        "def sent_to_words(sentences):\n",
        "    for sentence in sentences:\n",
        "        yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))  # deacc=True removes punctuations\n",
        "\n",
        "unused_data = papers.paper_text_processed.tolist()\n",
        "unused_data_words = list(sent_to_words(unused_data))\n",
        "data = papers.cleanest.tolist() # use cleanest instead\n",
        "data_words = list(sent_to_words(data))\n",
        "\n",
        "print(data_words[:1][0][:30])"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "\n",
              "  <style>\n",
              "    pre {\n",
              "        white-space: pre-wrap;\n",
              "    }\n",
              "  </style>\n",
              "  "
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {
            "tags": []
          }
        },
        {
          "output_type": "stream",
          "text": [
            "['discrete', 'systems', 'verification', 'finite', 'automata', 'lothar', 'thiele', 'techni', 'seo', 'nc', 'une', 'er', 'engineering', 'networks', 'overview', 'binary', 'decision', 'diagrams', 'representation', 'boolean', 'functions', 'comparing', 'circuits', 'representation', 'sets', 'finite', 'automata', 'reachability', 'states', 'comparing']\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "kUGEOtmUyT0d",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 547
        },
        "outputId": "1f278a8f-77d7-4b8e-aaaf-b85e9e7c90e1"
      },
      "source": [
        "# create dataframe for word counting\n",
        "\n",
        "DW_df = pd.Series(data_words, name=\"data_words_in_doc\")\n",
        "\n",
        "def join_df_items(textlist):\n",
        "    big_string = \" \".join(textlist)\n",
        "\n",
        "    return big_string\n",
        "\n",
        "papers['data_words'] = DW_df.apply(join_df_items)\n",
        "\n",
        "papers = papers.convert_dtypes()\n",
        "print(\"\\n papers dataframe as follows: \\n\")\n",
        "pp.pprint(papers.info())\n",
        "pp.pprint(papers[\"data_words\"].head())"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "\n",
              "  <style>\n",
              "    pre {\n",
              "        white-space: pre-wrap;\n",
              "    }\n",
              "  </style>\n",
              "  "
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {
            "tags": []
          }
        },
        {
          "output_type": "stream",
          "text": [
            "\n",
            " papers dataframe as follows: \n",
            "\n",
            "<class 'pandas.core.frame.DataFrame'>\n",
            "RangeIndex: 263 entries, 0 to 262\n",
            "Data columns (total 13 columns):\n",
            " #   Column                Non-Null Count  Dtype \n",
            "---  ------                --------------  ----- \n",
            " 0   text_in_doc           263 non-null    string\n",
            " 1   doc name              263 non-null    string\n",
            " 2   Directory Name        263 non-null    string\n",
            " 3   Directory Address     263 non-null    string\n",
            " 4   Directory Grouping    263 non-null    string\n",
            " 5   cleanest              263 non-null    string\n",
            " 6   pca - sw removed      263 non-null    object\n",
            " 7   kmeans_labels         263 non-null    string\n",
            " 8   use                   263 non-null    object\n",
            " 9   TSNE_USE              263 non-null    object\n",
            " 10  kmeans_USE            263 non-null    string\n",
            " 11  paper_text_processed  263 non-null    string\n",
            " 12  data_words            263 non-null    string\n",
            "dtypes: object(3), string(10)\n",
            "memory usage: 26.8+ KB\n",
            "None\n",
            "0    discrete systems verification finite automata ...\n",
            "1    discrete systems verification finite automata ...\n",
            "2    discrete systems petri nets lothar thiele eldg...\n",
            "3    discrete systems petri nets lothar thiele eldg...\n",
            "4    discrete systems laurent vanbever nsg ethz ite...\n",
            "Name: data_words, dtype: string\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "2UFSxDnXcCLy"
      },
      "source": [
        "** **\n",
        "#### Phrases: Bigram and Trigram\n",
        "** **\n",
        "\n",
        "Bigrams are two words frequently occurring together in the document. Trigrams are 3 words frequently occurring. Some examples in our example are: 'back_bumper', 'oil_leakage', 'maryland_college_park' etc.\n",
        "\n",
        "Gensim's Phrases model can build and implement the bigrams, trigrams, quadgrams and more. The two important arguments to Phrases are min_count and threshold."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "JnHNA45fcCLy",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 17
        },
        "outputId": "6ab9f507-cc93-45fc-b62e-da807bb1492d"
      },
      "source": [
        "# Build the bigram and trigram models\n",
        "bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100) # higher threshold fewer phrases.\n",
        "trigram = gensim.models.Phrases(bigram[data_words], threshold=100)  \n",
        "\n",
        "# Faster way to get a sentence clubbed as a trigram/bigram\n",
        "bigram_mod = gensim.models.phrases.Phraser(bigram)\n",
        "trigram_mod = gensim.models.phrases.Phraser(trigram)"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "\n",
              "  <style>\n",
              "    pre {\n",
              "        white-space: pre-wrap;\n",
              "    }\n",
              "  </style>\n",
              "  "
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {
            "tags": []
          }
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ut0KrHAwyZeH"
      },
      "source": [
        "### Viz Most Frequent Words\n",
        "\n",
        "how to make bar chart:\n",
        "* guide [here](https://plotly.com/python/bar-charts/)\n",
        "* documentation [here](https://plotly.com/python-api-reference/generated/plotly.express.bar)"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Yxic5Jouyat8",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 648
        },
        "outputId": "e5832c4b-12df-4ed4-c075-8611d965e7d4"
      },
      "source": [
        "# -------------------------------------------------------------------------------------------------------\n",
        "# User enters\n",
        "desired_top_words = 40\n",
        "\n",
        "\n",
        "# -------------------------------------------------------------------------------------------------------\n",
        "# admin stuff \n",
        "print(\"\\n top {0:3d} words (after generic stopword removal) are as follows \\n\".format(desired_top_words))\n",
        "np_top_words = hero.top_words(hero.remove_stopwords(papers[\"data_words\"]))\n",
        "the_top_words = pd.DataFrame(np_top_words)\n",
        "# pp.pprint(the_top_words.iloc[0:desired_top_words])\n",
        "the_top_words[\"Word\"] = the_top_words.index\n",
        "total_word_count = the_top_words[\"data_words\"].sum()\n",
        "total_unique_words = len(the_top_words[\"Word\"])\n",
        "exp_freq = 100 / total_unique_words # if evenly dist\n",
        "\n",
        "print(\"baseline percentage for {0:5d} unique words is: \".format(total_unique_words),\n",
        "      \"{0:6.4f} Percent of corpus\\n\\n\".format(exp_freq))\n",
        "\n",
        "def compute_perc_rep(w_count):\n",
        "    return (w_count / total_word_count) * 100\n",
        "the_top_words[\"perc_corpus\"] = the_top_words[\"data_words\"].apply(compute_perc_rep)\n",
        "the_top_words[\"rep_vs_baseline_ABS\"] = the_top_words[\"perc_corpus\"] - exp_freq # absolute percentage\n",
        "the_top_words[\"rep_vs_baseline\"] = the_top_words[\"rep_vs_baseline_ABS\"] / exp_freq # shown as multiple of what the baseline is\n",
        "\n",
        "# round\n",
        "the_top_words[\"perc_corpus\"] = the_top_words[\"perc_corpus\"].apply(round, ndigits=4)\n",
        "the_top_words[\"rep_vs_baseline_ABS\"] = the_top_words[\"rep_vs_baseline_ABS\"].apply(round, ndigits=3)\n",
        "the_top_words[\"rep_vs_baseline\"] = the_top_words[\"rep_vs_baseline\"].apply(round, ndigits=0)\n",
        "import plotly.express as px\n",
        "\n",
        "labels_for_chart = {\n",
        "    \"rep_vs_baseline\": \"Comparison vs. Avg. Word Freq (Multiplier)\",\n",
        "    \"perc_corpus\": \"Percentage of Total Corpus\",\n",
        "    \"rep_vs_baseline_ABS\": \"Diff vs. Avg. Word Freq %\",\n",
        "    \"data_words\": \"Count of Word\",\n",
        "} # usage - labels=\n",
        "fig_wc = None\n",
        "fig_wc = px.bar(the_top_words.iloc[0:desired_top_words,], x=\"Word\", color=\"rep_vs_baseline\",\n",
        "                y='data_words', title=\"top {0:3d} words in \".format(desired_top_words) + dataset_name,\n",
        "                hover_data = [\"Word\", \"data_words\", \"perc_corpus\", \"rep_vs_baseline_ABS\"],\n",
        "                text=\"rep_vs_baseline\", labels=labels_for_chart, template=\"seaborn\")\n",
        "fig_wc.show()"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "\n",
              "  <style>\n",
              "    pre {\n",
              "        white-space: pre-wrap;\n",
              "    }\n",
              "  </style>\n",
              "  "
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {
            "tags": []
          }
        },
        {
          "output_type": "stream",
          "text": [
            "\n",
            " top  40 words (after generic stopword removal) are as follows \n",
            "\n",
            "baseline percentage for 36397 unique words is:  0.0027 Percent of corpus\n",
            "\n",
            "\n"
          ],
          "name": "stdout"
        },
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "<html>\n",
              "<head><meta charset=\"utf-8\" /></head>\n",
              "<body>\n",
              "    <div>\n",
              "            <script src=\"https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.5/MathJax.js?config=TeX-AMS-MML_SVG\"></script><script type=\"text/javascript\">if (window.MathJax) {MathJax.Hub.Config({SVG: {font: \"STIX-Web\"}});}</script>\n",
              "                <script type=\"text/javascript\">window.PlotlyConfig = {MathJaxConfig: 'local'};</script>\n",
              "        <script src=\"https://cdn.plot.ly/plotly-latest.min.js\"></script>    \n",
              "            <div id=\"5c79d034-2d82-412e-99a6-a2d6873f69f2\" class=\"plotly-graph-div\" style=\"height:525px; width:100%;\"></div>\n",
              "            <script type=\"text/javascript\">\n",
              "                \n",
              "                    window.PLOTLYENV=window.PLOTLYENV || {};\n",
              "                    \n",
              "                if (document.getElementById(\"5c79d034-2d82-412e-99a6-a2d6873f69f2\")) {\n",
              "                    Plotly.newPlot(\n",
              "                        '5c79d034-2d82-412e-99a6-a2d6873f69f2',\n",
              "                        [{\"alignmentgroup\": \"True\", \"customdata\": [[\"set\", 4809, 0.5672, 0.564], [\"function\", 4716, 0.5562, 0.553], [\"distribution\", 4573, 0.5394, 0.537], [\"learning\", 4118, 0.4857, 0.483], [\"algorithm\", 3981, 0.4696, 0.467], [\"network\", 3852, 0.4543, 0.452], [\"models\", 3414, 0.4027, 0.4], [\"variables\", 3385, 0.3993, 0.397], [\"language\", 2995, 0.3533, 0.351], [\"pdf\", 2973, 0.3507, 0.348], [\"variable\", 2661, 0.3139, 0.311], [\"networks\", 2575, 0.3037, 0.301], [\"word\", 2518, 0.297, 0.294], [\"graph\", 2448, 0.2887, 0.286], [\"structure\", 2349, 0.2771, 0.274], [\"state\", 2285, 0.2695, 0.267], [\"tree\", 2248, 0.2651, 0.262], [\"np\", 2239, 0.2641, 0.261], [\"training\", 2191, 0.2584, 0.256], [\"parameters\", 2110, 0.2489, 0.246], [\"input\", 2086, 0.246, 0.243], [\"linear\", 2056, 0.2425, 0.24], [\"machine\", 2009, 0.237, 0.234], [\"words\", 1985, 0.2341, 0.231], [\"form\", 1961, 0.2313, 0.229], [\"score\", 1835, 0.2164, 0.214], [\"information\", 1760, 0.2076, 0.205], [\"neural\", 1736, 0.2048, 0.202], [\"likelihood\", 1685, 0.1987, 0.196], [\"inference\", 1665, 0.1964, 0.194], [\"section\", 1641, 0.1936, 0.191], [\"values\", 1531, 0.1806, 0.178], [\"sequence\", 1483, 0.1749, 0.172], [\"output\", 1471, 0.1735, 0.171], [\"gradient\", 1458, 0.172, 0.169], [\"algorithms\", 1447, 0.1707, 0.168], [\"single\", 1427, 0.1683, 0.166], [\"bayesian\", 1425, 0.1681, 0.165], [\"pp\", 1388, 0.1637, 0.161], [\"space\", 1374, 0.1621, 0.159]], \"hoverlabel\": {\"namelength\": 0}, \"hovertemplate\": \"Word=%{customdata[0]}<br>Count of Word=%{customdata[1]}<br>Percentage of Total Corpus=%{customdata[2]}<br>Diff vs. Avg. Word Freq %=%{customdata[3]}<br>Comparison vs. Avg. Word Freq (Multiplier)=%{marker.color}\", \"legendgroup\": \"\", \"marker\": {\"color\": [205.0, 201.0, 195.0, 176.0, 170.0, 164.0, 146.0, 144.0, 128.0, 127.0, 113.0, 110.0, 107.0, 104.0, 100.0, 97.0, 96.0, 95.0, 93.0, 90.0, 89.0, 87.0, 85.0, 84.0, 83.0, 78.0, 75.0, 74.0, 71.0, 70.0, 69.0, 65.0, 63.0, 62.0, 62.0, 61.0, 60.0, 60.0, 59.0, 58.0], \"coloraxis\": \"coloraxis\"}, \"name\": \"\", \"offsetgroup\": \"\", \"orientation\": \"v\", \"showlegend\": false, \"text\": [205.0, 201.0, 195.0, 176.0, 170.0, 164.0, 146.0, 144.0, 128.0, 127.0, 113.0, 110.0, 107.0, 104.0, 100.0, 97.0, 96.0, 95.0, 93.0, 90.0, 89.0, 87.0, 85.0, 84.0, 83.0, 78.0, 75.0, 74.0, 71.0, 70.0, 69.0, 65.0, 63.0, 62.0, 62.0, 61.0, 60.0, 60.0, 59.0, 58.0], \"textposition\": \"auto\", \"type\": \"bar\", \"x\": [\"set\", \"function\", \"distribution\", \"learning\", \"algorithm\", \"network\", \"models\", \"variables\", \"language\", \"pdf\", \"variable\", \"networks\", \"word\", \"graph\", \"structure\", \"state\", \"tree\", \"np\", \"training\", \"parameters\", \"input\", \"linear\", \"machine\", \"words\", \"form\", \"score\", \"information\", \"neural\", \"likelihood\", \"inference\", \"section\", \"values\", \"sequence\", \"output\", \"gradient\", \"algorithms\", \"single\", \"bayesian\", \"pp\", \"space\"], \"xaxis\": \"x\", \"y\": [4809, 4716, 4573, 4118, 3981, 3852, 3414, 3385, 2995, 2973, 2661, 2575, 2518, 2448, 2349, 2285, 2248, 2239, 2191, 2110, 2086, 2056, 2009, 1985, 1961, 1835, 1760, 1736, 1685, 1665, 1641, 1531, 1483, 1471, 1458, 1447, 1427, 1425, 1388, 1374], \"yaxis\": \"y\"}],\n",
              "                        {\"barmode\": \"relative\", \"coloraxis\": {\"colorbar\": {\"title\": {\"text\": \"Comparison vs. Avg. Word Freq (Multiplier)\"}}, \"colorscale\": [[0.0, \"rgb(2,4,25)\"], [0.0625, \"rgb(24,15,41)\"], [0.125, \"rgb(47,23,57)\"], [0.1875, \"rgb(71,28,72)\"], [0.25, \"rgb(97,30,82)\"], [0.3125, \"rgb(123,30,89)\"], [0.375, \"rgb(150,27,91)\"], [0.4375, \"rgb(177,22,88)\"], [0.5, \"rgb(203,26,79)\"], [0.5625, \"rgb(223,47,67)\"], [0.625, \"rgb(236,76,61)\"], [0.6875, \"rgb(242,107,73)\"], [0.75, \"rgb(244,135,95)\"], [0.8125, \"rgb(245,162,122)\"], [0.875, \"rgb(246,188,153)\"], [0.9375, \"rgb(247,212,187)\"], [1.0, \"rgb(250,234,220)\"]]}, \"legend\": {\"tracegroupgap\": 0}, \"template\": {\"data\": {\"bar\": [{\"error_x\": {\"color\": \"rgb(36,36,36)\"}, \"error_y\": {\"color\": \"rgb(36,36,36)\"}, \"marker\": {\"line\": {\"color\": \"rgb(234,234,242)\", \"width\": 0.5}}, \"type\": \"bar\"}], \"barpolar\": [{\"marker\": {\"line\": {\"color\": \"rgb(234,234,242)\", \"width\": 0.5}}, \"type\": \"barpolar\"}], \"carpet\": [{\"aaxis\": {\"endlinecolor\": \"rgb(36,36,36)\", \"gridcolor\": \"white\", \"linecolor\": \"white\", \"minorgridcolor\": \"white\", \"startlinecolor\": \"rgb(36,36,36)\"}, \"baxis\": {\"endlinecolor\": \"rgb(36,36,36)\", \"gridcolor\": \"white\", \"linecolor\": \"white\", \"minorgridcolor\": \"white\", \"startlinecolor\": \"rgb(36,36,36)\"}, \"type\": \"carpet\"}], \"choropleth\": [{\"colorbar\": {\"outlinewidth\": 0, \"tickcolor\": \"rgb(36,36,36)\", \"ticklen\": 8, \"ticks\": \"outside\", \"tickwidth\": 2}, \"type\": \"choropleth\"}], \"contour\": [{\"colorbar\": {\"outlinewidth\": 0, \"tickcolor\": \"rgb(36,36,36)\", \"ticklen\": 8, \"ticks\": \"outside\", \"tickwidth\": 2}, \"colorscale\": [[0.0, \"rgb(2,4,25)\"], [0.06274509803921569, \"rgb(24,15,41)\"], [0.12549019607843137, \"rgb(47,23,57)\"], [0.18823529411764706, \"rgb(71,28,72)\"], [0.25098039215686274, \"rgb(97,30,82)\"], [0.3137254901960784, \"rgb(123,30,89)\"], [0.3764705882352941, \"rgb(150,27,91)\"], [0.4392156862745098, \"rgb(177,22,88)\"], [0.5019607843137255, \"rgb(203,26,79)\"], [0.5647058823529412, \"rgb(223,47,67)\"], [0.6274509803921569, \"rgb(236,76,61)\"], [0.6901960784313725, \"rgb(242,107,73)\"], [0.7529411764705882, \"rgb(244,135,95)\"], [0.8156862745098039, \"rgb(245,162,122)\"], [0.8784313725490196, \"rgb(246,188,153)\"], [0.9411764705882353, \"rgb(247,212,187)\"], [1.0, \"rgb(250,234,220)\"]], \"type\": \"contour\"}], \"contourcarpet\": [{\"colorbar\": {\"outlinewidth\": 0, \"tickcolor\": \"rgb(36,36,36)\", \"ticklen\": 8, \"ticks\": \"outside\", \"tickwidth\": 2}, \"type\": \"contourcarpet\"}], \"heatmap\": [{\"colorbar\": {\"outlinewidth\": 0, \"tickcolor\": \"rgb(36,36,36)\", \"ticklen\": 8, \"ticks\": \"outside\", \"tickwidth\": 2}, \"colorscale\": [[0.0, \"rgb(2,4,25)\"], [0.06274509803921569, \"rgb(24,15,41)\"], [0.12549019607843137, \"rgb(47,23,57)\"], [0.18823529411764706, \"rgb(71,28,72)\"], [0.25098039215686274, \"rgb(97,30,82)\"], [0.3137254901960784, \"rgb(123,30,89)\"], [0.3764705882352941, \"rgb(150,27,91)\"], [0.4392156862745098, \"rgb(177,22,88)\"], [0.5019607843137255, \"rgb(203,26,79)\"], [0.5647058823529412, \"rgb(223,47,67)\"], [0.6274509803921569, \"rgb(236,76,61)\"], [0.6901960784313725, \"rgb(242,107,73)\"], [0.7529411764705882, \"rgb(244,135,95)\"], [0.8156862745098039, \"rgb(245,162,122)\"], [0.8784313725490196, \"rgb(246,188,153)\"], [0.9411764705882353, \"rgb(247,212,187)\"], [1.0, \"rgb(250,234,220)\"]], \"type\": \"heatmap\"}], \"heatmapgl\": [{\"colorbar\": {\"outlinewidth\": 0, \"tickcolor\": \"rgb(36,36,36)\", \"ticklen\": 8, \"ticks\": \"outside\", \"tickwidth\": 2}, \"colorscale\": [[0.0, \"rgb(2,4,25)\"], [0.06274509803921569, \"rgb(24,15,41)\"], [0.12549019607843137, \"rgb(47,23,57)\"], [0.18823529411764706, \"rgb(71,28,72)\"], [0.25098039215686274, \"rgb(97,30,82)\"], [0.3137254901960784, \"rgb(123,30,89)\"], [0.3764705882352941, \"rgb(150,27,91)\"], [0.4392156862745098, \"rgb(177,22,88)\"], [0.5019607843137255, \"rgb(203,26,79)\"], [0.5647058823529412, \"rgb(223,47,67)\"], [0.6274509803921569, \"rgb(236,76,61)\"], [0.6901960784313725, \"rgb(242,107,73)\"], [0.7529411764705882, \"rgb(244,135,95)\"], [0.8156862745098039, \"rgb(245,162,122)\"], [0.8784313725490196, \"rgb(246,188,153)\"], [0.9411764705882353, \"rgb(247,212,187)\"], [1.0, \"rgb(250,234,220)\"]], \"type\": \"heatmapgl\"}], \"histogram\": [{\"marker\": {\"colorbar\": {\"outlinewidth\": 0, \"tickcolor\": \"rgb(36,36,36)\", \"ticklen\": 8, \"ticks\": \"outside\", \"tickwidth\": 2}}, \"type\": \"histogram\"}], \"histogram2d\": [{\"colorbar\": {\"outlinewidth\": 0, \"tickcolor\": \"rgb(36,36,36)\", \"ticklen\": 8, \"ticks\": \"outside\", \"tickwidth\": 2}, \"colorscale\": [[0.0, \"rgb(2,4,25)\"], [0.06274509803921569, \"rgb(24,15,41)\"], [0.12549019607843137, \"rgb(47,23,57)\"], [0.18823529411764706, \"rgb(71,28,72)\"], [0.25098039215686274, \"rgb(97,30,82)\"], [0.3137254901960784, \"rgb(123,30,89)\"], [0.3764705882352941, \"rgb(150,27,91)\"], [0.4392156862745098, \"rgb(177,22,88)\"], [0.5019607843137255, \"rgb(203,26,79)\"], [0.5647058823529412, \"rgb(223,47,67)\"], [0.6274509803921569, \"rgb(236,76,61)\"], [0.6901960784313725, \"rgb(242,107,73)\"], [0.7529411764705882, \"rgb(244,135,95)\"], [0.8156862745098039, \"rgb(245,162,122)\"], [0.8784313725490196, \"rgb(246,188,153)\"], [0.9411764705882353, \"rgb(247,212,187)\"], [1.0, \"rgb(250,234,220)\"]], \"type\": \"histogram2d\"}], \"histogram2dcontour\": [{\"colorbar\": {\"outlinewidth\": 0, \"tickcolor\": \"rgb(36,36,36)\", \"ticklen\": 8, \"ticks\": \"outside\", \"tickwidth\": 2}, \"colorscale\": [[0.0, \"rgb(2,4,25)\"], [0.06274509803921569, \"rgb(24,15,41)\"], [0.12549019607843137, \"rgb(47,23,57)\"], [0.18823529411764706, \"rgb(71,28,72)\"], [0.25098039215686274, \"rgb(97,30,82)\"], [0.3137254901960784, \"rgb(123,30,89)\"], [0.3764705882352941, \"rgb(150,27,91)\"], [0.4392156862745098, \"rgb(177,22,88)\"], [0.5019607843137255, \"rgb(203,26,79)\"], [0.5647058823529412, \"rgb(223,47,67)\"], [0.6274509803921569, \"rgb(236,76,61)\"], [0.6901960784313725, \"rgb(242,107,73)\"], [0.7529411764705882, \"rgb(244,135,95)\"], [0.8156862745098039, \"rgb(245,162,122)\"], [0.8784313725490196, \"rgb(246,188,153)\"], [0.9411764705882353, \"rgb(247,212,187)\"], [1.0, \"rgb(250,234,220)\"]], \"type\": \"histogram2dcontour\"}], \"mesh3d\": [{\"colorbar\": {\"outlinewidth\": 0, \"tickcolor\": \"rgb(36,36,36)\", \"ticklen\": 8, \"ticks\": \"outside\", \"tickwidth\": 2}, \"type\": \"mesh3d\"}], \"parcoords\": [{\"line\": {\"colorbar\": {\"outlinewidth\": 0, \"tickcolor\": \"rgb(36,36,36)\", \"ticklen\": 8, \"ticks\": \"outside\", \"tickwidth\": 2}}, \"type\": \"parcoords\"}], \"pie\": [{\"automargin\": true, \"type\": \"pie\"}], \"scatter\": [{\"marker\": {\"colorbar\": {\"outlinewidth\": 0, \"tickcolor\": \"rgb(36,36,36)\", \"ticklen\": 8, \"ticks\": \"outside\", \"tickwidth\": 2}}, \"type\": \"scatter\"}], \"scatter3d\": [{\"line\": {\"colorbar\": {\"outlinewidth\": 0, \"tickcolor\": \"rgb(36,36,36)\", \"ticklen\": 8, \"ticks\": \"outside\", \"tickwidth\": 2}}, \"marker\": {\"colorbar\": {\"outlinewidth\": 0, \"tickcolor\": \"rgb(36,36,36)\", \"ticklen\": 8, \"ticks\": \"outside\", \"tickwidth\": 2}}, \"type\": \"scatter3d\"}], \"scattercarpet\": [{\"marker\": {\"colorbar\": {\"outlinewidth\": 0, \"tickcolor\": \"rgb(36,36,36)\", \"ticklen\": 8, \"ticks\": \"outside\", \"tickwidth\": 2}}, \"type\": \"scattercarpet\"}], \"scattergeo\": [{\"marker\": {\"colorbar\": {\"outlinewidth\": 0, \"tickcolor\": \"rgb(36,36,36)\", \"ticklen\": 8, \"ticks\": \"outside\", \"tickwidth\": 2}}, \"type\": \"scattergeo\"}], \"scattergl\": [{\"marker\": {\"colorbar\": {\"outlinewidth\": 0, \"tickcolor\": \"rgb(36,36,36)\", \"ticklen\": 8, \"ticks\": \"outside\", \"tickwidth\": 2}}, \"type\": \"scattergl\"}], \"scattermapbox\": [{\"marker\": {\"colorbar\": {\"outlinewidth\": 0, \"tickcolor\": \"rgb(36,36,36)\", \"ticklen\": 8, \"ticks\": \"outside\", \"tickwidth\": 2}}, \"type\": \"scattermapbox\"}], \"scatterpolar\": [{\"marker\": {\"colorbar\": {\"outlinewidth\": 0, \"tickcolor\": \"rgb(36,36,36)\", \"ticklen\": 8, \"ticks\": \"outside\", \"tickwidth\": 2}}, \"type\": \"scatterpolar\"}], \"scatterpolargl\": [{\"marker\": {\"colorbar\": {\"outlinewidth\": 0, \"tickcolor\": \"rgb(36,36,36)\", \"ticklen\": 8, \"ticks\": \"outside\", \"tickwidth\": 2}}, \"type\": \"scatterpolargl\"}], \"scatterternary\": [{\"marker\": {\"colorbar\": {\"outlinewidth\": 0, \"tickcolor\": \"rgb(36,36,36)\", \"ticklen\": 8, \"ticks\": \"outside\", \"tickwidth\": 2}}, \"type\": \"scatterternary\"}], \"surface\": [{\"colorbar\": {\"outlinewidth\": 0, \"tickcolor\": \"rgb(36,36,36)\", \"ticklen\": 8, \"ticks\": \"outside\", \"tickwidth\": 2}, \"colorscale\": [[0.0, \"rgb(2,4,25)\"], [0.06274509803921569, \"rgb(24,15,41)\"], [0.12549019607843137, \"rgb(47,23,57)\"], [0.18823529411764706, \"rgb(71,28,72)\"], [0.25098039215686274, \"rgb(97,30,82)\"], [0.3137254901960784, \"rgb(123,30,89)\"], [0.3764705882352941, \"rgb(150,27,91)\"], [0.4392156862745098, \"rgb(177,22,88)\"], [0.5019607843137255, \"rgb(203,26,79)\"], [0.5647058823529412, \"rgb(223,47,67)\"], [0.6274509803921569, \"rgb(236,76,61)\"], [0.6901960784313725, \"rgb(242,107,73)\"], [0.7529411764705882, \"rgb(244,135,95)\"], [0.8156862745098039, \"rgb(245,162,122)\"], [0.8784313725490196, \"rgb(246,188,153)\"], [0.9411764705882353, \"rgb(247,212,187)\"], [1.0, \"rgb(250,234,220)\"]], \"type\": \"surface\"}], \"table\": [{\"cells\": {\"fill\": {\"color\": \"rgb(231,231,240)\"}, \"line\": {\"color\": \"white\"}}, \"header\": {\"fill\": {\"color\": \"rgb(183,183,191)\"}, \"line\": {\"color\": \"white\"}}, \"type\": \"table\"}]}, \"layout\": {\"annotationdefaults\": {\"arrowcolor\": \"rgb(67,103,167)\"}, \"coloraxis\": {\"colorbar\": {\"outlinewidth\": 0, \"tickcolor\": \"rgb(36,36,36)\", \"ticklen\": 8, \"ticks\": \"outside\", \"tickwidth\": 2}}, \"colorscale\": {\"sequential\": [[0.0, \"rgb(2,4,25)\"], [0.06274509803921569, \"rgb(24,15,41)\"], [0.12549019607843137, \"rgb(47,23,57)\"], [0.18823529411764706, \"rgb(71,28,72)\"], [0.25098039215686274, \"rgb(97,30,82)\"], [0.3137254901960784, \"rgb(123,30,89)\"], [0.3764705882352941, \"rgb(150,27,91)\"], [0.4392156862745098, \"rgb(177,22,88)\"], [0.5019607843137255, \"rgb(203,26,79)\"], [0.5647058823529412, \"rgb(223,47,67)\"], [0.6274509803921569, \"rgb(236,76,61)\"], [0.6901960784313725, \"rgb(242,107,73)\"], [0.7529411764705882, \"rgb(244,135,95)\"], [0.8156862745098039, \"rgb(245,162,122)\"], [0.8784313725490196, \"rgb(246,188,153)\"], [0.9411764705882353, \"rgb(247,212,187)\"], [1.0, \"rgb(250,234,220)\"]], \"sequentialminus\": [[0.0, \"rgb(2,4,25)\"], [0.06274509803921569, \"rgb(24,15,41)\"], [0.12549019607843137, \"rgb(47,23,57)\"], [0.18823529411764706, \"rgb(71,28,72)\"], [0.25098039215686274, \"rgb(97,30,82)\"], [0.3137254901960784, \"rgb(123,30,89)\"], [0.3764705882352941, \"rgb(150,27,91)\"], [0.4392156862745098, \"rgb(177,22,88)\"], [0.5019607843137255, \"rgb(203,26,79)\"], [0.5647058823529412, \"rgb(223,47,67)\"], [0.6274509803921569, \"rgb(236,76,61)\"], [0.6901960784313725, \"rgb(242,107,73)\"], [0.7529411764705882, \"rgb(244,135,95)\"], [0.8156862745098039, \"rgb(245,162,122)\"], [0.8784313725490196, \"rgb(246,188,153)\"], [0.9411764705882353, \"rgb(247,212,187)\"], [1.0, \"rgb(250,234,220)\"]]}, \"colorway\": [\"rgb(76,114,176)\", \"rgb(221,132,82)\", \"rgb(85,168,104)\", \"rgb(196,78,82)\", \"rgb(129,114,179)\", \"rgb(147,120,96)\", \"rgb(218,139,195)\", \"rgb(140,140,140)\", \"rgb(204,185,116)\", \"rgb(100,181,205)\"], \"font\": {\"color\": \"rgb(36,36,36)\"}, \"geo\": {\"bgcolor\": \"white\", \"lakecolor\": \"white\", \"landcolor\": \"rgb(234,234,242)\", \"showlakes\": true, \"showland\": true, \"subunitcolor\": \"white\"}, \"hoverlabel\": {\"align\": \"left\"}, \"hovermode\": \"closest\", \"paper_bgcolor\": \"white\", \"plot_bgcolor\": \"rgb(234,234,242)\", \"polar\": {\"angularaxis\": {\"gridcolor\": \"white\", \"linecolor\": \"white\", \"showgrid\": true, \"ticks\": \"\"}, \"bgcolor\": \"rgb(234,234,242)\", \"radialaxis\": {\"gridcolor\": \"white\", \"linecolor\": \"white\", \"showgrid\": true, \"ticks\": \"\"}}, \"scene\": {\"xaxis\": {\"backgroundcolor\": \"rgb(234,234,242)\", \"gridcolor\": \"white\", \"gridwidth\": 2, \"linecolor\": \"white\", \"showbackground\": true, \"showgrid\": true, \"ticks\": \"\", \"zerolinecolor\": \"white\"}, \"yaxis\": {\"backgroundcolor\": \"rgb(234,234,242)\", \"gridcolor\": \"white\", \"gridwidth\": 2, \"linecolor\": \"white\", \"showbackground\": true, \"showgrid\": true, \"ticks\": \"\", \"zerolinecolor\": \"white\"}, \"zaxis\": {\"backgroundcolor\": \"rgb(234,234,242)\", \"gridcolor\": \"white\", \"gridwidth\": 2, \"linecolor\": \"white\", \"showbackground\": true, \"showgrid\": true, \"ticks\": \"\", \"zerolinecolor\": \"white\"}}, \"shapedefaults\": {\"fillcolor\": \"rgb(67,103,167)\", \"line\": {\"width\": 0}, \"opacity\": 0.5}, \"ternary\": {\"aaxis\": {\"gridcolor\": \"white\", \"linecolor\": \"white\", \"showgrid\": true, \"ticks\": \"\"}, \"baxis\": {\"gridcolor\": \"white\", \"linecolor\": \"white\", \"showgrid\": true, \"ticks\": \"\"}, \"bgcolor\": \"rgb(234,234,242)\", \"caxis\": {\"gridcolor\": \"white\", \"linecolor\": \"white\", \"showgrid\": true, \"ticks\": \"\"}}, \"xaxis\": {\"automargin\": true, \"gridcolor\": \"white\", \"linecolor\": \"white\", \"showgrid\": true, \"ticks\": \"\", \"title\": {\"standoff\": 15}, \"zerolinecolor\": \"white\"}, \"yaxis\": {\"automargin\": true, \"gridcolor\": \"white\", \"linecolor\": \"white\", \"showgrid\": true, \"ticks\": \"\", \"title\": {\"standoff\": 15}, \"zerolinecolor\": \"white\"}}}, \"title\": {\"text\": \"top  40 words in NLP - completed course topics_13062021_\"}, \"xaxis\": {\"anchor\": \"y\", \"domain\": [0.0, 1.0], \"title\": {\"text\": \"Word\"}}, \"yaxis\": {\"anchor\": \"x\", \"domain\": [0.0, 1.0], \"title\": {\"text\": \"Count of Word\"}}},\n",
              "                        {\"responsive\": true}\n",
              "                    ).then(function(){\n",
              "                            \n",
              "var gd = document.getElementById('5c79d034-2d82-412e-99a6-a2d6873f69f2');\n",
              "var x = new MutationObserver(function (mutations, observer) {{\n",
              "        var display = window.getComputedStyle(gd).display;\n",
              "        if (!display || display === 'none') {{\n",
              "            console.log([gd, 'removed!']);\n",
              "            Plotly.purge(gd);\n",
              "            observer.disconnect();\n",
              "        }}\n",
              "}});\n",
              "\n",
              "// Listen for the removal of the full notebook cells\n",
              "var notebookContainer = gd.closest('#notebook-container');\n",
              "if (notebookContainer) {{\n",
              "    x.observe(notebookContainer, {childList: true});\n",
              "}}\n",
              "\n",
              "// Listen for the clearing of the current output cell\n",
              "var outputEl = gd.closest('.output');\n",
              "if (outputEl) {{\n",
              "    x.observe(outputEl, {childList: true});\n",
              "}}\n",
              "\n",
              "                        })\n",
              "                };\n",
              "                \n",
              "            </script>\n",
              "        </div>\n",
              "</body>\n",
              "</html>"
            ]
          },
          "metadata": {
            "tags": []
          }
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "8jOCHaRgcCL0"
      },
      "source": [
        "###  Stopwords, Bigrams Lemmatize\n",
        "\n",
        "The phrase models are ready. Let’s define the functions to remove the stopwords, make trigrams and lemmatization and call them sequentially."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "46rmsj-qcCL0",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 17
        },
        "outputId": "294750cb-ef36-407d-d2d7-87bd85b2665e"
      },
      "source": [
        "# NLTK Stop words\n",
        "import nltk\n",
        "nltk.download('stopwords')\n",
        "from nltk.corpus import stopwords\n",
        "\n",
        "stop_words = stopwords.words('english')\n",
        "IML_stop_words = ['one', 'actually', 'want', 'see', 'right', 'like', 'spring', 'example', 'really', 'going',\n",
        "                   'yes', 'look', 'well', 'intro', 'let', 'introduction',\n",
        "                  'ml_tutorial','also', 'think', 'write', 'case', 'question']\n",
        "stop_words.extend(['from', 'subject', 're', 'edu', 'use', 'page', 'eth', 'ethz'\n",
        "                   'zurich', 'spring', 'spring 21', '2021', '2020'])\n",
        "stop_words.extend(IML_stop_words)"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "\n",
              "  <style>\n",
              "    pre {\n",
              "        white-space: pre-wrap;\n",
              "    }\n",
              "  </style>\n",
              "  "
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {
            "tags": []
          }
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Kta75TuccCL0",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 17
        },
        "outputId": "90a02b50-19dd-455e-b408-3bc3215f7446"
      },
      "source": [
        "# Define functions for stopwords, bigrams, trigrams and lemmatization\n",
        "def remove_stopwords(texts):\n",
        "    return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts]\n",
        "\n",
        "def make_bigrams(texts):\n",
        "    return [bigram_mod[doc] for doc in texts]\n",
        "\n",
        "def make_trigrams(texts):\n",
        "    return [trigram_mod[bigram_mod[doc]] for doc in texts]\n",
        "\n",
        "def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):\n",
        "    \"\"\"https://spacy.io/api/annotation\"\"\"\n",
        "    texts_out = []\n",
        "    for sent in texts:\n",
        "        doc = nlp(\" \".join(sent)) \n",
        "        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])\n",
        "    return texts_out"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "\n",
              "  <style>\n",
              "    pre {\n",
              "        white-space: pre-wrap;\n",
              "    }\n",
              "  </style>\n",
              "  "
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {
            "tags": []
          }
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "9XXcL1UFcCL0"
      },
      "source": [
        "Let's call the functions in order."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "iDzmzbZhcCL1",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 17
        },
        "outputId": "039ac623-e78c-4e7b-f1df-3e3f7a7af4af"
      },
      "source": [
        "%%capture\n",
        "!pip install --upgrade spacy\n",
        "!python -m spacy download en_core_web_lg"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "\n",
              "  <style>\n",
              "    pre {\n",
              "        white-space: pre-wrap;\n",
              "    }\n",
              "  </style>\n",
              "  "
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {
            "tags": []
          }
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "owrGc1RmcCL1",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 88
        },
        "outputId": "bbdf500e-69f7-4ac3-e9be-9380ce90aa4f"
      },
      "source": [
        "import spacy\n",
        "\n",
        "# Remove Stop Words\n",
        "data_words_nostops = remove_stopwords(data_words)\n",
        "\n",
        "# Form Bigrams\n",
        "data_words_bigrams = make_bigrams(data_words_nostops)\n",
        "\n",
        "# Initialize spacy 'en' model, keeping only tagger component (for efficiency)\n",
        "nlp = spacy.load(\"en_core_web_lg\", disable=['parser', 'ner'])\n",
        "\n",
        "# Do lemmatization keeping only noun, adj, vb, adv\n",
        "data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV'])\n",
        "\n",
        "print(data_lemmatized[:1][0][:30])"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "\n",
              "  <style>\n",
              "    pre {\n",
              "        white-space: pre-wrap;\n",
              "    }\n",
              "  </style>\n",
              "  "
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {
            "tags": []
          }
        },
        {
          "output_type": "stream",
          "text": [
            "['discrete', 'system', 'automata', 'techni', 'seo', 'engineering', 'network', 'overview', 'binary', 'representation', 'boolean', 'function', 'comparing_circuit', 'representation', 'set', 'reachability_state', 'compare', 'computation', 'tree', 'automata', 'specification', 'automatic', 'generation', 'software', 'hardware', 'simulation', 'automata', 'specification', 'automatic', 'generation']\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "LmXq3lXCcCL2"
      },
      "source": [
        "** **\n",
        "#### Data transformation: Corpus and Dictionary\n",
        "** **\n",
        "\n",
        "The two main inputs to the LDA topic model are the dictionary(id2word) and the corpus. Let’s create them."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "LQZ490PjcCL2",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 52
        },
        "outputId": "6feb8018-1a27-4128-ec76-427cb2fab296"
      },
      "source": [
        "import gensim.corpora as corpora\n",
        "\n",
        "# Create Dictionary\n",
        "id2word = corpora.Dictionary(data_lemmatized)\n",
        "\n",
        "# Create Corpus\n",
        "texts = data_lemmatized\n",
        "\n",
        "# Term Document Frequency\n",
        "corpus = [id2word.doc2bow(text) for text in texts]\n",
        "\n",
        "# View\n",
        "print(corpus[:1][0][:30])"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "\n",
              "  <style>\n",
              "    pre {\n",
              "        white-space: pre-wrap;\n",
              "    }\n",
              "  </style>\n",
              "  "
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {
            "tags": []
          }
        },
        {
          "output_type": "stream",
          "text": [
            "[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 2), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 1), (12, 2), (13, 6), (14, 1), (15, 3), (16, 2), (17, 1), (18, 1), (19, 1), (20, 7), (21, 2), (22, 1), (23, 1), (24, 20), (25, 1), (26, 2), (27, 11), (28, 1), (29, 1)]\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "SgOQScOUcCL2"
      },
      "source": [
        "** **\n",
        "### 'Base' Model \n",
        "** **\n",
        "\n",
        "We have everything required to train the base LDA model. In addition to the corpus and dictionary, you need to provide the number of topics as well. Apart from that, alpha and eta are hyperparameters that affect sparsity of the topics. According to the Gensim docs, both defaults to 1.0/num_topics prior (we'll use default for the base model).\n",
        "\n",
        "chunksize controls how many documents are processed at a time in the training algorithm. Increasing chunksize will speed up training, at least as long as the chunk of documents easily fit into memory.\n",
        "\n",
        "passes controls how often we train the model on the entire corpus (set to 10). Another word for passes might be \"epochs\". iterations is somewhat technical, but essentially it controls how often we repeat a particular loop over each document. It is important to set the number of \"passes\" and \"iterations\" high enough.\n",
        "\n",
        "[gensim official model doc](https://radimrehurek.com/gensim/models/ldamulticore.html)\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "rC58GSMPcCL2",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 35
        },
        "outputId": "f3ee3e7b-9c39-4948-905b-b2998a781957"
      },
      "source": [
        "# Build LDA model\n",
        "\n",
        "# start with recommended parameters at end of last notebook\n",
        "chunksize = 2000\n",
        "passes = 20\n",
        "iterations = 400 \n",
        "st = time.time()\n",
        "optimized_model = gensim.models.LdaMulticore(corpus=corpus,\n",
        "                                       id2word=id2word,\n",
        "                                       num_topics=10, \n",
        "                                       random_state=42,\n",
        "                                       chunksize=chunksize,\n",
        "                                       passes=passes,\n",
        "                                       iterations=iterations,\n",
        "                                       per_word_topics=True)\n",
        "base_rt = (time.time() - st) / 60\n",
        "print(\"took {} minutes to run ONE lda.multicore model\".format(round(\n",
        "    base_rt,2)))"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "\n",
              "  <style>\n",
              "    pre {\n",
              "        white-space: pre-wrap;\n",
              "    }\n",
              "  </style>\n",
              "  "
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {
            "tags": []
          }
        },
        {
          "output_type": "stream",
          "text": [
            "took 1.53 minutes to run ONE lda.multicore model\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "unRipsIccCL3"
      },
      "source": [
        "** **\n",
        "The above LDA model is built with 10 different topics where each topic is a combination of keywords and each keyword contributes a certain weightage to the topic.\n",
        "\n",
        "You can see the keywords for each topic and the weightage(importance) of each keyword using `optimized_model.print_topics()`"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "uOkUqm7QcCL3",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 723
        },
        "outputId": "7ad38cef-2c1f-47d6-a115-9c1b4d4a0c7f"
      },
      "source": [
        "from pprint import pprint\n",
        "\n",
        "# Print the Keyword in the 10 topics\n",
        "pprint(optimized_model.print_topics())\n",
        "doc_lda = optimized_model[corpus]"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "\n",
              "  <style>\n",
              "    pre {\n",
              "        white-space: pre-wrap;\n",
              "    }\n",
              "  </style>\n",
              "  "
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {
            "tags": []
          }
        },
        {
          "output_type": "stream",
          "text": [
            "[(0,\n",
            "  '0.019*\"state\" + 0.011*\"function\" + 0.011*\"set\" + 0.009*\"word\" + '\n",
            "  '0.009*\"linear\" + 0.008*\"weight\" + 0.008*\"algorithm\" + 0.007*\"transition\" + '\n",
            "  '0.007*\"label\" + 0.007*\"model\"'),\n",
            " (1,\n",
            "  '0.016*\"function\" + 0.011*\"mean\" + 0.011*\"semantic\" + 0.009*\"apply\" + '\n",
            "  '0.008*\"term\" + 0.007*\"variable\" + 0.007*\"sentence\" + 0.006*\"expression\" + '\n",
            "  '0.006*\"order\" + 0.006*\"input\"'),\n",
            " (2,\n",
            "  '0.009*\"language\" + 0.009*\"pdf\" + 0.008*\"variable\" + 0.008*\"statement\" + '\n",
            "  '0.008*\"set\" + 0.008*\"string\" + 0.007*\"table\" + 0.007*\"compiler\" + '\n",
            "  '0.007*\"item\" + 0.007*\"form\"'),\n",
            " (3,\n",
            "  '0.019*\"graph\" + 0.018*\"algorithm\" + 0.017*\"network\" + 0.015*\"factor\" + '\n",
            "  '0.014*\"variable\" + 0.009*\"set\" + 0.009*\"function\" + 0.009*\"distribution\" + '\n",
            "  '0.008*\"clique\" + 0.008*\"message\"'),\n",
            " (4,\n",
            "  '0.008*\"sentence\" + 0.008*\"word\" + 0.007*\"feature\" + 0.006*\"label\" + '\n",
            "  '0.006*\"review\" + 0.005*\"entity\" + 0.005*\"document\" + 0.005*\"language\" + '\n",
            "  '0.005*\"sentiment\" + 0.004*\"classification\"'),\n",
            " (5,\n",
            "  '0.040*\"state\" + 0.026*\"language\" + 0.025*\"regular\" + 0.021*\"accept\" + '\n",
            "  '0.016*\"string\" + 0.012*\"automaton\" + 0.010*\"point\" + 0.010*\"finite\" + '\n",
            "  '0.008*\"alphabet\" + 0.008*\"regular_language\"'),\n",
            " (6,\n",
            "  '0.021*\"distribution\" + 0.020*\"variable\" + 0.015*\"network\" + 0.012*\"set\" + '\n",
            "  '0.011*\"parameter\" + 0.009*\"structure\" + 0.008*\"function\" + 0.007*\"bayesian\" '\n",
            "  '+ 0.007*\"pdf\" + 0.006*\"define\"'),\n",
            " (7,\n",
            "  '0.016*\"tree\" + 0.015*\"grammar\" + 0.014*\"algorithm\" + 0.013*\"word\" + '\n",
            "  '0.010*\"parse\" + 0.008*\"rule\" + 0.008*\"symbol\" + 0.008*\"edge\" + 0.007*\"set\" '\n",
            "  '+ 0.006*\"root\"'),\n",
            " (8,\n",
            "  '0.015*\"network\" + 0.014*\"learn\" + 0.012*\"model\" + 0.012*\"function\" + '\n",
            "  '0.011*\"learning\" + 0.010*\"training\" + 0.009*\"distribution\" + '\n",
            "  '0.009*\"algorithm\" + 0.008*\"deep\" + 0.007*\"neural\"'),\n",
            " (9,\n",
            "  '0.018*\"word\" + 0.011*\"function\" + 0.010*\"language\" + 0.007*\"sequence\" + '\n",
            "  '0.006*\"output\" + 0.006*\"input\" + 0.006*\"mean\" + 0.006*\"set\" + '\n",
            "  '0.005*\"weight\" + 0.005*\"sentence\"')]\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "gCiLdGGLcCL3"
      },
      "source": [
        "#### Compute Model Perplexity and Coherence Score\n",
        "\n",
        "Let's calculate the baseline coherence score"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "T6DEIVhjcCL3",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 35
        },
        "outputId": "f3971b66-d940-4dbd-d033-44f0aa773ce4"
      },
      "source": [
        "from gensim.models import CoherenceModel\n",
        "\n",
        "# Compute Coherence Score\n",
        "coherence_model_lda = CoherenceModel(model=optimized_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v')\n",
        "coherence_lda = coherence_model_lda.get_coherence()\n",
        "print('Coherence Score: ', coherence_lda)"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "\n",
              "  <style>\n",
              "    pre {\n",
              "        white-space: pre-wrap;\n",
              "    }\n",
              "  </style>\n",
              "  "
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {
            "tags": []
          }
        },
        {
          "output_type": "stream",
          "text": [
            "Coherence Score:  0.4634461858377061\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "w_6l7kx9cCMD"
      },
      "source": [
        "** **\n",
        "## Hyperparameter tuning\n",
        "** **\n",
        "First, let's differentiate between model hyperparameters and model parameters :\n",
        "\n",
        "- `Model hyperparameters` can be thought of as settings for a machine learning algorithm that are tuned by the data scientist before training. Examples would be the number of trees in the random forest, or in our case, number of topics K\n",
        "\n",
        "- `Model parameters` can be thought of as what the model learns during training, such as the weights for each word in a given topic.\n",
        "\n",
        "Now that we have the baseline coherence score for the default LDA model, let's perform a series of sensitivity tests to help determine the following model hyperparameters: \n",
        "- Number of Topics (K)\n",
        "- Dirichlet hyperparameter alpha: Document-Topic Density\n",
        "- Dirichlet hyperparameter beta: Word-Topic Density\n",
        "\n",
        "We'll perform these tests in sequence, one parameter at a time by keeping others constant and run them over the two difference validation corpus sets. We'll use `C_v` as our choice of metric for performance comparison \n",
        "\n",
        "---\n",
        "**<font color=\"orange\">Note:**\n",
        "\n",
        "may switch to [hyperopt](http://hyperopt.github.io/hyperopt/) because optuna is being trash"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "qa9m-mvSJ6z6"
      },
      "source": [
        "### install and import"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "hQyD2JJALsHD",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 17
        },
        "outputId": "766b563a-6ea1-4990-86d9-fbc1dd355352"
      },
      "source": [
        "%%capture\n",
        "!pip install -U optuna\n",
        "!pip install -U pandas \n",
        "!pip install -U openpyxl\n",
        "!pip install -U tqdm\n",
        "from tqdm.auto import tqdm\n",
        "import optuna"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "\n",
              "  <style>\n",
              "    pre {\n",
              "        white-space: pre-wrap;\n",
              "    }\n",
              "  </style>\n",
              "  "
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {
            "tags": []
          }
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "NTRvnZyqPF3s"
      },
      "source": [
        "Optuna details and links:\n",
        "\n",
        "1. Their [quickstart guide](https://optuna.readthedocs.io/en/stable/tutorial/10_key_features/001_first.html#sphx-glr-tutorial-10-key-features-001-first-py) covers most things relevant here\n",
        "2. [suggest_categorical](https://optuna.readthedocs.io/en/stable/reference/generated/optuna.trial.Trial.html#optuna.trial.Trial.suggest_categorical)\n",
        "\n",
        "\n",
        "```\n",
        "def objective(trial):\n",
        "    kernel = trial.suggest_categorical(\"kernel\", [\"linear\", \"poly\", \"rbf\"])\n",
        "    clf = SVC(kernel=kernel, gamma=\"scale\", random_state=0)\n",
        "    clf.fit(X_train, y_train)\n",
        "    return clf.score(X_valid, y_valid)\n",
        "\n",
        "\n",
        "study = optuna.create_study(direction=\"maximize\")\n",
        "study.optimize(objective, n_trials=3)\n",
        "```\n",
        "---"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "0ZB4D_GLcCME"
      },
      "source": [
        "### Optuna Study\n",
        "\n",
        "Let's call the function, and iterate it over the range of topics, alpha, and beta parameter values\n",
        "\n",
        "full documentation on the study object [here](https://optuna.readthedocs.io/en/v1.4.0/reference/study.html)\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "2QAqVn6AAWGB"
      },
      "source": [
        "---\n",
        "CPU Info:"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "JYqoycSZAY_b",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 105
        },
        "outputId": "1b4a99a8-228a-4db5-a6dc-c41ba7ff859d"
      },
      "source": [
        "want_CPU_details = False\n",
        "\n",
        "if want_CPU_details:\n",
        "    # # disk info\n",
        "    !df -h\n",
        "    # # CPU Info\n",
        "    !cat /proc/cpuinfo \n",
        "    # # memory info\n",
        "    !cat /proc/meminfo\n",
        "\n",
        "# number of CPUs\n",
        "import multiprocessing\n",
        "num_cpus = multiprocessing.cpu_count()\n",
        "\n",
        "desired_num_jobs = int(round(num_cpus / 8)) # for optuna parallel processing- \n",
        "# be conservative so you don't get disconnected\n",
        "# can manually reset as needed:\n",
        "# desired_num_jobs = 6 \n",
        "w_per_job = int(round(num_cpus / desired_num_jobs)-1)\n",
        "if w_per_job < 1:\n",
        "    w_per_job = 1\n",
        "print(\"\\n\\nTotal CPUs on this runtime is: \", num_cpus)\n",
        "print(\"will attempt optimization with {} parallel jobs\".format(desired_num_jobs))\n",
        "print(\"there will be {} workers per job\".format(w_per_job))"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "\n",
              "  <style>\n",
              "    pre {\n",
              "        white-space: pre-wrap;\n",
              "    }\n",
              "  </style>\n",
              "  "
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {
            "tags": []
          }
        },
        {
          "output_type": "stream",
          "text": [
            "\n",
            "\n",
            "Total CPUs on this runtime is:  40\n",
            "will attempt optimization with 5 parallel jobs\n",
            "there will be 7 workers per job\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "otATvYnsgQp4"
      },
      "source": [
        "#### LDA Model + Coherence fn()\n",
        "\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "D_cqdQXZcCMD",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 17
        },
        "outputId": "fbc99ce0-8c87-4614-dd39-eeb7a390a678"
      },
      "source": [
        "# supporting function\n",
        "\n",
        "def compute_coherence_values(corpus, dictionary, k, a, b):\n",
        "    # start with recommended parameters at end of last notebook\n",
        "    chunksize = 2000\n",
        "    passes = 20\n",
        "    iterations = 400 \n",
        "    lda_model = gensim.models.LdaMulticore(corpus=corpus,\n",
        "                                           id2word=dictionary,\n",
        "                                           num_topics=k, \n",
        "                                           random_state=42,\n",
        "                                           iterations=iterations,\n",
        "                                           chunksize=chunksize,\n",
        "                                           passes=passes,\n",
        "                                           alpha=a,\n",
        "                                           eta=b)\n",
        "                                        #    workers=w_per_job)\n",
        "    # workers=w_per_job used as defined above to ensure parallelization works \n",
        "    coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, \n",
        "                                         dictionary=id2word, coherence='c_v')\n",
        "    c_score = coherence_model_lda.get_coherence()\n",
        "\n",
        "    # cleanup\n",
        "    del coherence_model_lda\n",
        "    del lda_model\n",
        "    return c_score\n"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "\n",
              "  <style>\n",
              "    pre {\n",
              "        white-space: pre-wrap;\n",
              "    }\n",
              "  </style>\n",
              "  "
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {
            "tags": []
          }
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "6ZFRR0xvAke6"
      },
      "source": [
        "---"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "nLO3SUhXI_TP"
      },
      "source": [
        "#### optimization function"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "lj5QkNeTcCME",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 66,
          "referenced_widgets": [
            "87abe54fb5fd4b159ce9b60e1713db71",
            "337b0665c9f844be89cfac7fd6037392",
            "2f71d5a82fb54fd5ab44f990b58d1594",
            "0fd1aab5dde0417381d2f3d8a299abe4",
            "a6f01478747c448ea50bef18d5e10376",
            "3bc5275189a448ab96b3c6b89a73f650",
            "018ea19333c34a858be68be671450c15",
            "ed125ff112dd4621a3f01c021758fc2a"
          ]
        },
        "outputId": "4ee4e379-8fb8-4398-a4db-bd4b1a7179a5"
      },
      "source": [
        " \n",
        "import pandas as pd\n",
        "\n",
        "run_date = datetime.now()\n",
        "id = run_date.strftime(\"_%d-%m-%Y_\")\n",
        "\n",
        "# Alpha parameter\n",
        "# alpha.append('symmetric')\n",
        "# alpha.append('asymmetric')\n",
        "\n",
        "# Beta parameter\n",
        "# beta.append('symmetric')\n",
        "\n",
        "# Training and Test sets\n",
        "num_of_docs = len(corpus)\n",
        "corpus_sets = [gensim.utils.ClippedCorpus(corpus, int(num_of_docs*0.8)), \n",
        "               corpus]\n",
        "\n",
        "corpus_title = ['80% Corpus', '100% Corpus']\n",
        "\n",
        "corpi_dict = {\n",
        "    \"75p\":0,\n",
        "    \"100p\":1\n",
        "}\n",
        "c_names = [\"75p\", \"100p\"]\n",
        "# Can take a long time to run\n",
        "count = 0\n",
        "pbar_opt = tqdm(total=desired_trials, desc=\"overall hyperparam opt progress\")\n",
        "\n",
        "def objective(trial):\n",
        "    # adjust step size. Unless Optuna study graphs indicate the need, don't change\n",
        "    pbar_opt.update(1)\n",
        "    ab_step_size = 0.01\n",
        "    a = trial.suggest_float(\"Alpha\", 0.01, .99, step=ab_step_size)\n",
        "    b = trial.suggest_float(\"Beta\", 0.01, .99, step=ab_step_size)\n",
        "    num_docs = len(papers)\n",
        "\n",
        "    # set max possible number of topics based on how many inputs documents you have\n",
        "    # otherwise too hard to interpret tbh\n",
        "    if num_docs < 10:\n",
        "        param1 = 5\n",
        "    elif num_docs < 20:\n",
        "        param1 = 10\n",
        "    elif num_docs < 100:\n",
        "        param1 = 20\n",
        "    elif num_docs < 300:\n",
        "        param1 = 30\n",
        "    elif num_docs < 500:\n",
        "        param1 = 40\n",
        "    else:\n",
        "        param1 = 60\n",
        "    \n",
        "    # upper_bound = min(param1, len(papers)) # if # of docs is less than max # \n",
        "    upper_bound = param1\n",
        "    # of topics, then set the max number of topics = # of docs\n",
        "\n",
        "    k = trial.suggest_int(\"Topics\", 2, upper_bound)\n",
        "    # corpus = trial.suggest_categorical(\"corpus_size\",c_names)\n",
        "    # corpus_loc = corpi_dict.get(corpus)\n",
        "    return compute_coherence_values(corpus_sets[0], \n",
        "                                    id2word, k, a, b)\n",
        "\n",
        "feather_ext = \".ftr\"\n",
        "excel_ext = \".xlsx\"\n",
        "outputname = dataset_name + \"_OPTUNIZED_hyperparams_\" # this is basename then u add \n",
        "os.chdir(out_directory)\n",
        "\n",
        "print(\"finished completing pre-work for hyperparameter opt \", datetime.now())"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "\n",
              "  <style>\n",
              "    pre {\n",
              "        white-space: pre-wrap;\n",
              "    }\n",
              "  </style>\n",
              "  "
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {
            "tags": []
          }
        },
        {
          "output_type": "display_data",
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "87abe54fb5fd4b159ce9b60e1713db71",
              "version_minor": 0,
              "version_major": 2
            },
            "text/plain": [
              "HBox(children=(FloatProgress(value=0.0, description='overall hyperparam opt progress', max=200.0, style=Progre…"
            ]
          },
          "metadata": {
            "tags": []
          }
        },
        {
          "output_type": "stream",
          "text": [
            "finished completing pre-work for hyperparameter opt  2021-06-13 21:39:49.861880\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "RCRWSIxYJFoo"
      },
      "source": [
        "#### check if existing study obj"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Ohvn3VyTEE78",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 158
        },
        "outputId": "581319b7-e9d9-4e32-eaa8-36365a4de3bc"
      },
      "source": [
        "# check folder for existing database of runs\n",
        "print(\"checking the folder: \\n\", out_directory, \"\\n for a prior database named: \",\n",
        "      outputname + feather_ext)\n",
        "list_of_files = natsorted(\n",
        "    [f for f in listdir(out_directory) if isfile(join(out_directory, f)) and f.endswith(\".ftr\")])\n",
        "print(\"\\nFound the following files in directory: \\n\")\n",
        "pp.pprint(list_of_files, compact=True, indent=10)"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "\n",
              "  <style>\n",
              "    pre {\n",
              "        white-space: pre-wrap;\n",
              "    }\n",
              "  </style>\n",
              "  "
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {
            "tags": []
          }
        },
        {
          "output_type": "stream",
          "text": [
            "checking the folder: \n",
            " /content/drive/My Drive/Programming/topic_models/nlp_2021 \n",
            " for a prior database named:  NLP - completed course topics_13062021__OPTUNIZED_hyperparams_.ftr\n",
            "\n",
            "Found the following files in directory: \n",
            "\n",
            "[         'nlp Raw Text database (before NLP and Cleaning).ftr',\n",
            "          'nlp multi-directory compilation.ftr']\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "7QKLH19JJKcb"
      },
      "source": [
        "#### run or load study\n",
        "\n",
        "```\n",
        "study.optimize(func, n_trials=None, timeout=None, n_jobs=1, catch=(), \n",
        "    callbacks=None, gc_after_trial=False, show_progress_bar=False)\n",
        "```\n",
        "- [link](https://optuna.readthedocs.io/en/stable/reference/generated/optuna.study.Study.html#optuna.study.Study.optimize) to study.optimize() docs\n",
        "- official [gensim LDA tutorial](https://radimrehurek.com/gensim/auto_examples/tutorials/run_lda.html#sphx-glr-auto-examples-tutorials-run-lda-py)"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 1000,
          "referenced_widgets": [
            "da4a0d7763fd4bfbb645f9869e5eb4bd",
            "b1d1848653dd4507ba0f8dd043fe56d3",
            "38d72a88a18b4f5e874b38762b0ec5c8",
            "36d37021dc624db1971f0ec8d84d82d6",
            "0b6923cdf6664b4da6cd30f791954a75",
            "aba682afae59467b9ceed3b4d4850049",
            "94abba656356419ba228c7f0ce2cd8f8",
            "9804e36448a742bd8be143f664d6943a"
          ]
        },
        "id": "USqRJVXdwB_X",
        "outputId": "d59fb105-7e93-49b7-e02d-e24044b9d5bd"
      },
      "source": [
        " ran_study = False # for graph section \n",
        "\n",
        "if (outputname + feather_ext) in list_of_files:\n",
        "    print(\"\\n Success - Found a file with run results, loading that\")\n",
        "    best_TM_df = pd.read_feather(join(out_directory,\n",
        "                                      outputname + feather_ext)).convert_dtypes()\n",
        "else:\n",
        "    print(\"\\n Did not find a database - starting new study\")\n",
        "    optuna_start_time = time.time()\n",
        "    name_stu = \"topic_model_gensim_\" + dataset_name\n",
        "    study = optuna.create_study(study_name=name_stu, direction=\"maximize\")\n",
        "    study.optimize(objective, n_trials=desired_trials, timeout=(5*3600),\n",
        "                   show_progress_bar=True, gc_after_trial=True)\n",
        "                #    n_jobs=desired_num_jobs)\n",
        "    # set to timeout per run in case gets stuck\n",
        "    optuna_end_time = time.time()\n",
        "    optuna_rt = (optuna_end_time - optuna_start_time) / 60\n",
        "    print(\"\\n\\nFinished, took a total of {} minutes. Time\".format(round(optuna_rt)),\n",
        "          datetime.now())\n",
        "    # report best parameters\n",
        "    best_params_tm = study.best_params\n",
        "    print(\"best parameters are: \\n\", best_params_tm)\n",
        "    print(\"\\nassociated max coherence score: \", study.best_value)\n",
        "    # get df of all runs and download\n",
        "    best_TM_df = study.trials_dataframe()\n",
        "    best_TM_df = best_TM_df.convert_dtypes()\n",
        "    best_TM_df.to_feather(os.path.join(out_directory,outputname + feather_ext))\n",
        "    best_TM_df.to_excel(os.path.join(out_directory,outputname + excel_ext))\n",
        "    files.download(os.path.join(out_directory,outputname + excel_ext))\n",
        "\n",
        "    ran_study=True\n",
        "pbar_opt.close() # close bar (for tracking multiprocessing )\n",
        "\n",
        "# eventually need to add some way of comparing vs asymmetric and symmetric values"
      ],
      "execution_count": null,
      "outputs": [
        {
          "output_type": "display_data",
          "data": {
            "text/html": [
              "\n",
              "  <style>\n",
              "    pre {\n",
              "        white-space: pre-wrap;\n",
              "    }\n",
              "  </style>\n",
              "  "
            ],
            "text/plain": [
              "<IPython.core.display.HTML object>"
            ]
          },
          "metadata": {
            "tags": []
          }
        },
        {
          "output_type": "stream",
          "text": [
            "\n",
            " Did not find a database - starting new study\n"
          ],
          "name": "stdout"
        },
        {
          "output_type": "stream",
          "text": [
            "/usr/local/lib/python3.7/dist-packages/optuna/progress_bar.py:47: ExperimentalWarning:\n",
            "\n",
            "Progress bar is experimental (supported from v1.2.0). The interface can change in the future.\n",
            "\n"
          ],
          "name": "stderr"
        },
        {
          "output_type": "display_data",
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "da4a0d7763fd4bfbb645f9869e5eb4bd",
              "version_minor": 0,
              "version_major": 2
            },
            "text/plain": [
              "HBox(children=(FloatProgress(value=0.0, max=200.0), HTML(value='')))"
            ]
          },
          "metadata": {
            "tags": []
          }
        },
        {
          "output_type": "stream",
          "text": [
            "\u001b[32m[I 2021-06-13 21:41:07,562]\u001b[0m Trial 0 finished with value: 0.4698522179498537 and parameters: {'Alpha': 0.14, 'Beta': 0.86, 'Topics': 25}. Best is trial 0 with value: 0.4698522179498537.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 21:42:44,227]\u001b[0m Trial 1 finished with value: 0.4823710376030164 and parameters: {'Alpha': 0.03, 'Beta': 0.25, 'Topics': 27}. Best is trial 1 with value: 0.4823710376030164.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 21:44:10,070]\u001b[0m Trial 2 finished with value: 0.5125114477830792 and parameters: {'Alpha': 0.5800000000000001, 'Beta': 0.5700000000000001, 'Topics': 20}. Best is trial 2 with value: 0.5125114477830792.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 21:45:14,074]\u001b[0m Trial 3 finished with value: 0.44628035801138044 and parameters: {'Alpha': 0.24000000000000002, 'Beta': 0.52, 'Topics': 9}. Best is trial 2 with value: 0.5125114477830792.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 21:46:33,307]\u001b[0m Trial 4 finished with value: 0.43205053331824694 and parameters: {'Alpha': 0.67, 'Beta': 0.8300000000000001, 'Topics': 27}. Best is trial 2 with value: 0.5125114477830792.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 21:47:35,494]\u001b[0m Trial 5 finished with value: 0.5130213862666723 and parameters: {'Alpha': 0.16, 'Beta': 0.81, 'Topics': 10}. Best is trial 5 with value: 0.5130213862666723.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 21:48:35,456]\u001b[0m Trial 6 finished with value: 0.4576366215769906 and parameters: {'Alpha': 0.03, 'Beta': 0.65, 'Topics': 9}. Best is trial 5 with value: 0.5130213862666723.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 21:49:53,827]\u001b[0m Trial 7 finished with value: 0.46896193192069646 and parameters: {'Alpha': 0.52, 'Beta': 0.37, 'Topics': 15}. Best is trial 5 with value: 0.5130213862666723.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 21:51:08,619]\u001b[0m Trial 8 finished with value: 0.4521368167490359 and parameters: {'Alpha': 0.6, 'Beta': 0.33, 'Topics': 13}. Best is trial 5 with value: 0.5130213862666723.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 21:51:51,693]\u001b[0m Trial 9 finished with value: 0.43452685720263984 and parameters: {'Alpha': 0.34, 'Beta': 0.79, 'Topics': 4}. Best is trial 5 with value: 0.5130213862666723.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 21:52:31,116]\u001b[0m Trial 10 finished with value: 0.40131578153590447 and parameters: {'Alpha': 0.97, 'Beta': 0.11, 'Topics': 3}. Best is trial 5 with value: 0.5130213862666723.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 21:53:41,607]\u001b[0m Trial 11 finished with value: 0.39809740180840253 and parameters: {'Alpha': 0.78, 'Beta': 0.98, 'Topics': 21}. Best is trial 5 with value: 0.5130213862666723.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 21:55:08,805]\u001b[0m Trial 12 finished with value: 0.4982515869212801 and parameters: {'Alpha': 0.4, 'Beta': 0.61, 'Topics': 20}. Best is trial 5 with value: 0.5130213862666723.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 21:56:26,336]\u001b[0m Trial 13 finished with value: 0.44265809833049036 and parameters: {'Alpha': 0.8400000000000001, 'Beta': 0.7000000000000001, 'Topics': 19}. Best is trial 5 with value: 0.5130213862666723.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 21:57:37,473]\u001b[0m Trial 14 finished with value: 0.4290256203482091 and parameters: {'Alpha': 0.44, 'Beta': 0.5, 'Topics': 11}. Best is trial 5 with value: 0.5130213862666723.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 21:58:24,791]\u001b[0m Trial 15 finished with value: 0.3760337879412314 and parameters: {'Alpha': 0.26, 'Beta': 0.9400000000000001, 'Topics': 6}. Best is trial 5 with value: 0.5130213862666723.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 21:59:43,817]\u001b[0m Trial 16 finished with value: 0.4523239245810328 and parameters: {'Alpha': 0.5700000000000001, 'Beta': 0.72, 'Topics': 16}. Best is trial 5 with value: 0.5130213862666723.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 22:01:07,181]\u001b[0m Trial 17 finished with value: 0.4640475983478964 and parameters: {'Alpha': 0.74, 'Beta': 0.5700000000000001, 'Topics': 23}. Best is trial 5 with value: 0.5130213862666723.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 22:02:25,373]\u001b[0m Trial 18 finished with value: 0.44899524962435766 and parameters: {'Alpha': 0.17, 'Beta': 0.46, 'Topics': 17}. Best is trial 5 with value: 0.5130213862666723.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 22:03:46,044]\u001b[0m Trial 19 finished with value: 0.3445269782489518 and parameters: {'Alpha': 0.92, 'Beta': 0.9, 'Topics': 30}. Best is trial 5 with value: 0.5130213862666723.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 22:04:44,560]\u001b[0m Trial 20 finished with value: 0.46479983857833024 and parameters: {'Alpha': 0.65, 'Beta': 0.76, 'Topics': 7}. Best is trial 5 with value: 0.5130213862666723.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 22:06:10,050]\u001b[0m Trial 21 finished with value: 0.498872181663243 and parameters: {'Alpha': 0.43, 'Beta': 0.63, 'Topics': 20}. Best is trial 5 with value: 0.5130213862666723.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 22:07:23,597]\u001b[0m Trial 22 finished with value: 0.5218655294366282 and parameters: {'Alpha': 0.47000000000000003, 'Beta': 0.65, 'Topics': 13}. Best is trial 22 with value: 0.5218655294366282.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 22:08:37,322]\u001b[0m Trial 23 finished with value: 0.45290792516918416 and parameters: {'Alpha': 0.51, 'Beta': 0.46, 'Topics': 12}. Best is trial 22 with value: 0.5218655294366282.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 22:09:52,269]\u001b[0m Trial 24 finished with value: 0.501787847753949 and parameters: {'Alpha': 0.3, 'Beta': 0.7100000000000001, 'Topics': 14}. Best is trial 22 with value: 0.5218655294366282.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 22:11:11,575]\u001b[0m Trial 25 finished with value: 0.5102429212701031 and parameters: {'Alpha': 0.16, 'Beta': 0.56, 'Topics': 18}. Best is trial 22 with value: 0.5218655294366282.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 22:12:18,038]\u001b[0m Trial 26 finished with value: 0.4842010156709075 and parameters: {'Alpha': 0.37, 'Beta': 0.67, 'Topics': 10}. Best is trial 22 with value: 0.5218655294366282.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 22:13:07,210]\u001b[0m Trial 27 finished with value: 0.4405173236738258 and parameters: {'Alpha': 0.68, 'Beta': 0.4, 'Topics': 6}. Best is trial 22 with value: 0.5218655294366282.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 22:14:26,959]\u001b[0m Trial 28 finished with value: 0.48080887628994345 and parameters: {'Alpha': 0.46, 'Beta': 0.78, 'Topics': 14}. Best is trial 22 with value: 0.5218655294366282.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 22:15:42,205]\u001b[0m Trial 29 finished with value: 0.47005855463290597 and parameters: {'Alpha': 0.5800000000000001, 'Beta': 0.87, 'Topics': 23}. Best is trial 22 with value: 0.5218655294366282.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 22:16:59,674]\u001b[0m Trial 30 finished with value: 0.44903149239930795 and parameters: {'Alpha': 0.06999999999999999, 'Beta': 0.28, 'Topics': 16}. Best is trial 22 with value: 0.5218655294366282.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 22:18:18,924]\u001b[0m Trial 31 finished with value: 0.5095088793291623 and parameters: {'Alpha': 0.09999999999999999, 'Beta': 0.55, 'Topics': 18}. Best is trial 22 with value: 0.5218655294366282.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 22:19:41,912]\u001b[0m Trial 32 finished with value: 0.45470344156277387 and parameters: {'Alpha': 0.19, 'Beta': 0.61, 'Topics': 22}. Best is trial 22 with value: 0.5218655294366282.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 22:21:00,044]\u001b[0m Trial 33 finished with value: 0.5072999011650773 and parameters: {'Alpha': 0.01, 'Beta': 0.56, 'Topics': 18}. Best is trial 22 with value: 0.5218655294366282.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 22:22:28,135]\u001b[0m Trial 34 finished with value: 0.49482752281583564 and parameters: {'Alpha': 0.21000000000000002, 'Beta': 0.45, 'Topics': 25}. Best is trial 22 with value: 0.5218655294366282.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 22:23:36,622]\u001b[0m Trial 35 finished with value: 0.48054815855691607 and parameters: {'Alpha': 0.15000000000000002, 'Beta': 0.8400000000000001, 'Topics': 12}. Best is trial 22 with value: 0.5218655294366282.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 22:24:38,748]\u001b[0m Trial 36 finished with value: 0.4319782329260091 and parameters: {'Alpha': 0.28, 'Beta': 0.76, 'Topics': 8}. Best is trial 22 with value: 0.5218655294366282.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 22:25:56,536]\u001b[0m Trial 37 finished with value: 0.5157054096420252 and parameters: {'Alpha': 0.09, 'Beta': 0.53, 'Topics': 15}. Best is trial 22 with value: 0.5218655294366282.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 22:27:11,551]\u001b[0m Trial 38 finished with value: 0.42860273544195465 and parameters: {'Alpha': 0.06999999999999999, 'Beta': 0.4, 'Topics': 14}. Best is trial 22 with value: 0.5218655294366282.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 22:28:18,561]\u001b[0m Trial 39 finished with value: 0.48612432139505246 and parameters: {'Alpha': 0.32, 'Beta': 0.67, 'Topics': 10}. Best is trial 22 with value: 0.5218655294366282.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 22:29:40,054]\u001b[0m Trial 40 finished with value: 0.5172201643493645 and parameters: {'Alpha': 0.54, 'Beta': 0.51, 'Topics': 15}. Best is trial 22 with value: 0.5218655294366282.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 22:30:59,598]\u001b[0m Trial 41 finished with value: 0.5077807852481343 and parameters: {'Alpha': 0.63, 'Beta': 0.5, 'Topics': 15}. Best is trial 22 with value: 0.5218655294366282.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 22:32:16,661]\u001b[0m Trial 42 finished with value: 0.46608012743166066 and parameters: {'Alpha': 0.55, 'Beta': 0.34, 'Topics': 12}. Best is trial 22 with value: 0.5218655294366282.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 22:33:31,173]\u001b[0m Trial 43 finished with value: 0.5253502208423165 and parameters: {'Alpha': 0.53, 'Beta': 0.6, 'Topics': 13}. Best is trial 43 with value: 0.5253502208423165.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 22:34:45,924]\u001b[0m Trial 44 finished with value: 0.523582742201423 and parameters: {'Alpha': 0.51, 'Beta': 0.63, 'Topics': 13}. Best is trial 43 with value: 0.5253502208423165.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 22:36:00,119]\u001b[0m Trial 45 finished with value: 0.5258442668163738 and parameters: {'Alpha': 0.5, 'Beta': 0.6, 'Topics': 13}. Best is trial 45 with value: 0.5258442668163738.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 22:37:14,362]\u001b[0m Trial 46 finished with value: 0.5197096681704839 and parameters: {'Alpha': 0.48000000000000004, 'Beta': 0.61, 'Topics': 13}. Best is trial 45 with value: 0.5258442668163738.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 22:38:25,164]\u001b[0m Trial 47 finished with value: 0.5197096681704839 and parameters: {'Alpha': 0.38, 'Beta': 0.61, 'Topics': 13}. Best is trial 45 with value: 0.5258442668163738.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 22:39:29,481]\u001b[0m Trial 48 finished with value: 0.4323106676823192 and parameters: {'Alpha': 0.38, 'Beta': 0.72, 'Topics': 9}. Best is trial 45 with value: 0.5258442668163738.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 22:40:40,780]\u001b[0m Trial 49 finished with value: 0.48545226214708365 and parameters: {'Alpha': 0.5, 'Beta': 0.65, 'Topics': 11}. Best is trial 45 with value: 0.5258442668163738.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 22:41:54,652]\u001b[0m Trial 50 finished with value: 0.519778771065034 and parameters: {'Alpha': 0.42000000000000004, 'Beta': 0.59, 'Topics': 13}. Best is trial 45 with value: 0.5258442668163738.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 22:43:07,715]\u001b[0m Trial 51 finished with value: 0.5170510979306682 and parameters: {'Alpha': 0.41000000000000003, 'Beta': 0.62, 'Topics': 13}. Best is trial 45 with value: 0.5258442668163738.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 22:44:18,444]\u001b[0m Trial 52 finished with value: 0.524375178248986 and parameters: {'Alpha': 0.34, 'Beta': 0.59, 'Topics': 12}. Best is trial 45 with value: 0.5258442668163738.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 22:45:27,989]\u001b[0m Trial 53 finished with value: 0.49739548851154347 and parameters: {'Alpha': 0.44, 'Beta': 0.6900000000000001, 'Topics': 11}. Best is trial 45 with value: 0.5258442668163738.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 22:46:36,951]\u001b[0m Trial 54 finished with value: 0.45220360517788816 and parameters: {'Alpha': 0.49, 'Beta': 0.5800000000000001, 'Topics': 9}. Best is trial 45 with value: 0.5258442668163738.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 22:47:47,583]\u001b[0m Trial 55 finished with value: 0.4258444398132998 and parameters: {'Alpha': 0.34, 'Beta': 0.47000000000000003, 'Topics': 11}. Best is trial 45 with value: 0.5258442668163738.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 22:49:13,896]\u001b[0m Trial 56 finished with value: 0.4629940825540151 and parameters: {'Alpha': 0.6, 'Beta': 0.75, 'Topics': 17}. Best is trial 45 with value: 0.5258442668163738.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 22:50:29,401]\u001b[0m Trial 57 finished with value: 0.44646433062712304 and parameters: {'Alpha': 0.6900000000000001, 'Beta': 0.05, 'Topics': 14}. Best is trial 45 with value: 0.5258442668163738.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 22:51:41,085]\u001b[0m Trial 58 finished with value: 0.5419743976008068 and parameters: {'Alpha': 0.52, 'Beta': 0.66, 'Topics': 12}. Best is trial 58 with value: 0.5419743976008068.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 22:52:52,751]\u001b[0m Trial 59 finished with value: 0.5401555401012103 and parameters: {'Alpha': 0.53, 'Beta': 0.65, 'Topics': 12}. Best is trial 58 with value: 0.5419743976008068.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 22:53:58,008]\u001b[0m Trial 60 finished with value: 0.4360396970223409 and parameters: {'Alpha': 0.53, 'Beta': 0.73, 'Topics': 8}. Best is trial 58 with value: 0.5419743976008068.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 22:55:06,702]\u001b[0m Trial 61 finished with value: 0.5425616829113199 and parameters: {'Alpha': 0.63, 'Beta': 0.66, 'Topics': 12}. Best is trial 61 with value: 0.5425616829113199.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 22:56:12,796]\u001b[0m Trial 62 finished with value: 0.49463431767583027 and parameters: {'Alpha': 0.62, 'Beta': 0.68, 'Topics': 10}. Best is trial 61 with value: 0.5425616829113199.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 22:57:23,305]\u001b[0m Trial 63 finished with value: 0.5422667465472705 and parameters: {'Alpha': 0.73, 'Beta': 0.64, 'Topics': 12}. Best is trial 61 with value: 0.5425616829113199.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 22:58:31,462]\u001b[0m Trial 64 finished with value: 0.4723174698269472 and parameters: {'Alpha': 0.78, 'Beta': 0.53, 'Topics': 12}. Best is trial 61 with value: 0.5425616829113199.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 22:59:35,882]\u001b[0m Trial 65 finished with value: 0.4376410584182087 and parameters: {'Alpha': 0.75, 'Beta': 0.8, 'Topics': 8}. Best is trial 61 with value: 0.5425616829113199.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 23:00:44,856]\u001b[0m Trial 66 finished with value: 0.4885764507792992 and parameters: {'Alpha': 0.56, 'Beta': 0.65, 'Topics': 11}. Best is trial 61 with value: 0.5425616829113199.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 23:01:51,173]\u001b[0m Trial 67 finished with value: 0.44790174791728327 and parameters: {'Alpha': 0.7100000000000001, 'Beta': 0.5800000000000001, 'Topics': 10}. Best is trial 61 with value: 0.5425616829113199.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 23:03:10,760]\u001b[0m Trial 68 finished with value: 0.46268790446516667 and parameters: {'Alpha': 0.65, 'Beta': 0.7100000000000001, 'Topics': 16}. Best is trial 61 with value: 0.5425616829113199.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 23:04:00,256]\u001b[0m Trial 69 finished with value: 0.4620717508074665 and parameters: {'Alpha': 0.8300000000000001, 'Beta': 0.54, 'Topics': 6}. Best is trial 61 with value: 0.5425616829113199.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 23:05:20,181]\u001b[0m Trial 70 finished with value: 0.4913352866158677 and parameters: {'Alpha': 0.59, 'Beta': 0.65, 'Topics': 14}. Best is trial 61 with value: 0.5425616829113199.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 23:06:34,055]\u001b[0m Trial 71 finished with value: 0.5302545268276198 and parameters: {'Alpha': 0.52, 'Beta': 0.63, 'Topics': 12}. Best is trial 61 with value: 0.5425616829113199.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 23:07:49,052]\u001b[0m Trial 72 finished with value: 0.49591721171600694 and parameters: {'Alpha': 0.47000000000000003, 'Beta': 0.68, 'Topics': 11}. Best is trial 61 with value: 0.5425616829113199.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 23:09:02,764]\u001b[0m Trial 73 finished with value: 0.5183035649512162 and parameters: {'Alpha': 0.62, 'Beta': 0.5800000000000001, 'Topics': 12}. Best is trial 61 with value: 0.5425616829113199.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 23:10:17,047]\u001b[0m Trial 74 finished with value: 0.5406311617295704 and parameters: {'Alpha': 0.5700000000000001, 'Beta': 0.74, 'Topics': 12}. Best is trial 61 with value: 0.5425616829113199.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 23:11:35,585]\u001b[0m Trial 75 finished with value: 0.4744843815027741 and parameters: {'Alpha': 0.5700000000000001, 'Beta': 0.8200000000000001, 'Topics': 14}. Best is trial 61 with value: 0.5425616829113199.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 23:12:42,241]\u001b[0m Trial 76 finished with value: 0.4409147487407976 and parameters: {'Alpha': 0.65, 'Beta': 0.75, 'Topics': 9}. Best is trial 61 with value: 0.5425616829113199.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 23:14:04,536]\u001b[0m Trial 77 finished with value: 0.524232619004465 and parameters: {'Alpha': 0.51, 'Beta': 0.7000000000000001, 'Topics': 15}. Best is trial 61 with value: 0.5425616829113199.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 23:15:13,379]\u001b[0m Trial 78 finished with value: 0.512981436946705 and parameters: {'Alpha': 0.73, 'Beta': 0.78, 'Topics': 10}. Best is trial 61 with value: 0.5425616829113199.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 23:16:43,028]\u001b[0m Trial 79 finished with value: 0.45474473500660906 and parameters: {'Alpha': 0.53, 'Beta': 0.73, 'Topics': 17}. Best is trial 61 with value: 0.5425616829113199.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 23:17:58,219]\u001b[0m Trial 80 finished with value: 0.5325757755092664 and parameters: {'Alpha': 0.45, 'Beta': 0.65, 'Topics': 12}. Best is trial 61 with value: 0.5425616829113199.\u001b[0m\n",
            "\u001b[32m[I 2021-06-13 23:19:11,475]\u001b[0m Trial 81 finished with value: 0.5285770663648043 and parameters: {'Alpha': 0.46, 'Beta': 0.64, 'Topics': 12}. Best is trial 61 with value: 0.5425616829113199.\u001b[0m\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "4UCPfF80qLjc"
      },
      "source": [
        "#### save\n",
        "\n",
        "can potentially re-load later with the following code:\n",
        "\n",
        "```\n",
        "study = joblib.load(\"study.pkl\")\n",
        "print(\"Best trial until now:\")\n",
        "print(\" Value: \", study.best_trial.value)\n",
        "```\n",
        "\n",
        "-<font color=\"yellow\"> need to eventually incorporate with the above directory checking etc. </font>\n",
        "\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Ykx76G4Ypbe3"
      },
      "source": [
        "import joblib\n",
        "study_name = dataset_name + \"_optuna_study.pkl\"\n",
        "joblib.dump(study, \"study.pkl\")\n",
        "print(\"saved optuna study to directory {} - \", datetime.now())"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "9KAlfiZPmHsV"
      },
      "source": [
        "*Note - Typical Output DF structure (for later usage)*\n",
        "\n",
        "\n",
        "---\n",
        "\n",
        " Num   | Column  | Non-Null Count  |Dtype         \n",
        "----- | --------- | -------- | ----      \n",
        " 0  | number       |      1000 non-null |  Int64         \n",
        " 1  | Value    |      1000 non-null   |  float64       \n",
        " 2   |  datetime_start   |    1000 non-null    | datetime64[ns]\n",
        " 3   |  datetime_complete   | 1000 non-null    | datetime64[ns]\n",
        " 4  |   duration         |    1000 non-null  |   Int64         \n",
        " 5   |  Alpha        |        1000 non-null   |  float64       \n",
        " 6   |  Beta       |          1000 non-null   |  float64       \n",
        " 7  |   Topics       |        1000 non-null   |  Int64         \n",
        " 8  |   state        |        1000 non-null  |   string  \n",
        "\n",
        "*Note: Alpha/Beta/Topics will typically have \"parameter\" in front of them or something like that*\n",
        "\n",
        "---"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "hOZUm9jvXtD2"
      },
      "source": [
        "### Graphs - Optuna\n",
        "\n",
        "See [this tutorial link](https://optuna.readthedocs.io/en/stable/tutorial/10_key_features/005_visualization.html) for details"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "EZnAefmFV-Ww"
      },
      "source": [
        "from optuna.visualization import plot_contour\n",
        "from optuna.visualization import plot_edf\n",
        "from optuna.visualization import plot_intermediate_values\n",
        "from optuna.visualization import plot_optimization_history\n",
        "from optuna.visualization import plot_parallel_coordinate\n",
        "from optuna.visualization import plot_param_importances\n",
        "from optuna.visualization import plot_slice\n"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "c3vV27yhmal-"
      },
      "source": [
        "spacer = \"\\n\\n\"\n",
        "try:\n",
        "    plot_optimization_history(study)\n",
        "    print(spacer)\n",
        "    plot_param_importances(study)\n",
        "    print(spacer)\n",
        "    plot_contour(study)\n",
        "    print(spacer)\n",
        "    plot_edf(study)\n",
        "    print(spacer)\n",
        "except:\n",
        "    print(\"unable to plot - check whether study object exists - \", \n",
        "          datetime.now())\n"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "7EGNXgSsv0X3"
      },
      "source": [
        "## Numerically Compute Best Results\n",
        "\n",
        "\n",
        "if weird stuff happens you can manually read in data in a scratch cell from the excel or .ftr \n",
        "\n",
        "```\n",
        "best_TM_df = pd.read_excel(os.path.join(out_directory,\"heather_hp.xlsx\"))\n",
        "\n",
        "best_TM_df.info()\n",
        "```\n",
        "\n",
        "---"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "-BpCENc_v2q3"
      },
      "source": [
        "%%capture\n",
        "!pip install -U natsort\n",
        "!pip install -U plotly\n",
        "from natsort import natsorted\n",
        "from natsort import natsort_keygen"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "_xZdHy8ov5Fq"
      },
      "source": [
        "opt_df = best_TM_df.sort_values(key=natsort_keygen(), by=\"value\", \n",
        "                                ascending=False, axis=0, ignore_index=True)\n",
        "\n",
        "print(\"the best parameters are as follows: \\n\")\n",
        "pp.pprint(opt_df.loc[0,:])\n"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "7csEvLnlv9LL"
      },
      "source": [
        "%load_ext google.colab.data_table\n",
        "from google.colab import data_table\n",
        "\n",
        "print(\"the top 20 parameter sets are as follows: \\n\")\n",
        "# pp.pprint(opt_df.loc[:10,:])\n",
        "\n",
        "\n",
        "opt_alpha = opt_df.loc[0,\"params_Alpha\"]\n",
        "opt_beta = opt_df.loc[0,\"params_Beta\"]\n",
        "opt_topics = opt_df.loc[0,\"params_Topics\"]\n",
        "\n",
        "data_table.DataTable(opt_df, include_index=True, num_rows_per_page=20, min_width=\"40\")\n",
        "# opt_df.head()"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "a0brGrV9DZ_q"
      },
      "source": [
        "---\n",
        "\n",
        "### Validate Training on 100% Corpus"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "t2Hy2ewL3YxQ"
      },
      "source": [
        "# validate on 100% corpus - from earlier now use corpus_sets[1]\n",
        "\n",
        "val_df = opt_df.loc[:20,:].copy()\n",
        "val_df[\"Coherence_ValSet\"] = 0.000001\n",
        "val_df = val_df.convert_dtypes()\n",
        "pp.pprint(val_df.info())\n",
        "\n",
        "print(\"starting validation on top 20\", datetime.now())\n",
        "for index, row in val_df.iterrows():\n",
        "    this_a = row[\"params_Alpha\"]\n",
        "    this_b = row[\"params_Beta\"]\n",
        "    this_k = row[\"params_Topics\"]\n",
        "    this_coherence = compute_coherence_values(corpus_sets[1], \n",
        "                                    id2word, this_k, this_a, this_b)\n",
        "    val_df.loc[index, \"Coherence_ValSet\"] = this_coherence\n",
        "print(\"finished validation on top 20\", datetime.now(),\"\\n\")\n",
        "\n",
        "val_df[\"Train_v_Test_Diff\"] = val_df[\"value\"] - val_df[\"Coherence_ValSet\"]\n",
        "labels_fig_val = {\n",
        "    \"Coherence Score - Differential\":\"Train_v_Test_Diff\",\n",
        "    \"Top HyperParameter Config #\":\"index\"\n",
        "}\n",
        "fig_validation = px.line(val_df, x=val_df.index, y='Train_v_Test_Diff',\n",
        "                         title='Validation Differential [Train - Test] for top 20 scores',\n",
        "                         labels=labels_fig_val)\n",
        "fig_validation.show()"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ZliuAdNZcCME"
      },
      "source": [
        "** **\n",
        "# Optimized Model\n",
        "** **\n",
        "\n",
        "Based on external evaluation (Code to be added from Excel based analysis), let's train the final model with parameters yielding highest coherence score"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "_JHGqjN0prBb"
      },
      "source": [
        "## create"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "PTyI0lcpcCME"
      },
      "source": [
        "%%capture\n",
        "# using recommended parameters for these 3\n",
        "chunksize = 2000\n",
        "passes = 20\n",
        "iterations = 400 \n",
        "\n",
        "# create model\n",
        "optimized_model = gensim.models.LdaMulticore(corpus=corpus,\n",
        "                                        id2word=id2word,\n",
        "                                        num_topics=opt_topics, \n",
        "                                        random_state=42,\n",
        "                                        iterations=iterations,\n",
        "                                        chunksize=chunksize,\n",
        "                                        passes=passes,\n",
        "                                        alpha=opt_alpha,\n",
        "                                        eta=opt_beta)\n",
        "\n",
        "fname = \"optimized_model_for_\" + dataset_name\n",
        "optimized_model.save(fname, ignore=('state', 'dispatcher'))\n"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "5DU62Jz7cCME"
      },
      "source": [
        "from pprint import pprint\n",
        "\n",
        "pprint(optimized_model.print_topics())\n",
        "doc_lda = optimized_model[corpus]"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "st0OYnpscCME"
      },
      "source": [
        "** **\n",
        "## Visualize Results\n",
        "** **"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "hAzX_AW_-P1S"
      },
      "source": [
        "%%capture\n",
        "!pip install -U pyLDAvis\n"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "sWU91vGCcCMF"
      },
      "source": [
        "import pyLDAvis\n",
        "import pyLDAvis.gensim_models as gensimvis\n",
        "import pickle\n",
        "import pandas as pd\n",
        "from datetime import datetime\n",
        "\n",
        "# Visualize the topics\n",
        "pyLDAvis.enable_notebook()\n",
        "\n",
        "LDAvis_prepared = gensimvis.prepare(optimized_model, corpus, id2word)\n",
        "# ^ apparently time consuming \n",
        "LDA_visname = \"pyLDAvis for \" + dataset_name + \".html\"\n",
        "# pyLDAvis.save_html(LDAvis_prepared, os.path.join(out_directory, LDA_visname))\n",
        "# files.download(os.path.join(out_directory, LDA_visname))\n",
        "print(\"\\n\\ncompleted vis and saved at: \", datetime.now())\n",
        "LDAvis_prepared"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "klQpzsEISfZo"
      },
      "source": [
        "## basic interpretation\n",
        "\n",
        "\n",
        "define function"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Jl_HG4oeShBb"
      },
      "source": [
        "def format_topics_sentences(ldamodel=None, corpus=corpus, texts=data):\n",
        "    # Init output\n",
        "    sent_topics_df = pd.DataFrame()\n",
        "\n",
        "    # Get main topic in each document\n",
        "    for i, row_list in enumerate(ldamodel[corpus]):\n",
        "        row = row_list[0] if ldamodel.per_word_topics else row_list            \n",
        "        # print(row)\n",
        "        row = sorted(row, key=lambda x: (x[1]), reverse=True)\n",
        "        # Get the Dominant topic, Perc Contribution and Keywords for each document\n",
        "        for j, (topic_num, prop_topic) in enumerate(row):\n",
        "            if j == 0:  # => dominant topic\n",
        "                wp = ldamodel.show_topic(topic_num)\n",
        "                topic_keywords = \", \".join([word for word, prop in wp])\n",
        "                sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]), ignore_index=True)\n",
        "            else:\n",
        "                break\n",
        "    sent_topics_df.columns = ['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords']\n",
        "\n",
        "    # Add original text to the end of the output\n",
        "    contents = pd.Series(texts)\n",
        "    sent_topics_df = pd.concat([sent_topics_df, contents], axis=1)\n",
        "    return(sent_topics_df)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "5t9r95QvUChb"
      },
      "source": [
        "generate things"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Zz0sEYdnSlKv"
      },
      "source": [
        "\n",
        "# ------------------------------------------------------------------------\n",
        "# Part I: for each document, assign a topic to it\n",
        "# ------------------------------------------------------------------------\n",
        "# run function w variables defined above\n",
        "topic_sen_auto = format_topics_sentences(ldamodel=optimized_model, corpus=corpus, \n",
        "                                                  texts=id2word)\n",
        "num_input_docs = len(papers)\n",
        "# Format\n",
        "auto_dom_df = topic_sen_auto.reset_index()\n",
        "auto_dom_df.columns = ['Doc_Num.', 'Dominant_Topic', 'Topic_Perc_Contrib', 'Keywords', 'Text']\n",
        "auto_dom_df['Doc_Name'] = papers[\"doc name\"]\n",
        "# print(\"\\nPart I: for each document, assign a topic to it\")\n",
        "# pp.pprint(auto_dom_df.loc[:, ['Doc_Num.', 'Doc_Name', 'Dominant_Topic']].head(10))\n",
        "\n",
        "# ------------------------------------------------------------------------\n",
        "# Part II: for each topic, show representative words\n",
        "# ------------------------------------------------------------------------\n",
        "auto_dom_df_mallet = pd.DataFrame()\n",
        "topic_sen_auto_grpd = topic_sen_auto.groupby('Dominant_Topic')\n",
        "\n",
        "for i, grp in topic_sen_auto_grpd:\n",
        "    auto_dom_df_mallet = pd.concat([auto_dom_df_mallet, \n",
        "                                             grp.sort_values(['Perc_Contribution'], ascending=False).head(1)], \n",
        "                                            axis=0)\n",
        "\n",
        "# Reset Index    \n",
        "auto_dom_df_mallet.reset_index(drop=True, inplace=True)\n",
        "\n",
        "# Format\n",
        "auto_dom_df_mallet.columns = ['Topic_Num', \"Topic_Perc_Contrib\", \"Keywords\", \"Representative_Text\"]\n",
        "# auto_dom_df_mallet[\"num_rep_words\"] = auto_dom_df_mallet[\"Representative_Text\"].apply(len)\n",
        "# Show\n",
        "# print(\"\\nPart II: for each topic, show representative words\")\n",
        "# pp.pprint(auto_dom_df_mallet.loc[:, ['Topic_Num', \"Topic_Perc_Contrib\", \"Keywords\"]].head(10))\n",
        "\n",
        "print(\"completed basic analytics table gen at \", datetime.now())"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "RUdn0Cd4dqdZ"
      },
      "source": [
        "Part I: for each document, assign a topic to it"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "lm0T_K87dqia"
      },
      "source": [
        "from google.colab import data_table\n",
        "data_table.DataTable(auto_dom_df.loc[:len(papers), ['Doc_Num.', 'Doc_Name', 'Dominant_Topic']], \n",
        "                     include_index=False, num_rows_per_page=20, min_width=\"40\")\n",
        "# auto_dom_df.loc[:len(papers), ['Doc_Num.', 'Doc_Name', 'Dominant_Topic']].head()"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "OOzi6oTCdkAl"
      },
      "source": [
        "Part II: for each topic, show representative words"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "jZUlEiQVdiI8"
      },
      "source": [
        "data_table.DataTable(auto_dom_df_mallet.loc[:, ['Topic_Num', \"Topic_Perc_Contrib\", \"Keywords\"]], \n",
        "                     include_index=True, num_rows_per_page=20, min_width=\"40\")"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "-OGk3Dy9eYFe"
      },
      "source": [
        "### Download Data as Excel"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "VSbZf7zWdcA5"
      },
      "source": [
        "# part 1\n",
        "doc_cat_name = dataset_name + \"_assigned_LDA_topics_to_docs.xlsx\"\n",
        "# don't export more rows than you have docs\n",
        "auto_dom_df.iloc[:len(papers),:].to_excel(os.path.join(out_directory, doc_cat_name), index=False)\n",
        "files.download(os.path.join(out_directory, doc_cat_name))\n",
        "# part 2\n",
        "export_name_auto = dataset_name + \"_auto-optimized LDA topics.xlsx\"\n",
        "auto_dom_df_mallet.to_excel(os.path.join(out_directory, export_name_auto), index=False)\n",
        "files.download(os.path.join(out_directory, export_name_auto))"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "URFd_5IoswNM"
      },
      "source": [
        "---\n",
        "## Wordcloud Viz"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "9c4RI4TxtOtP"
      },
      "source": [
        "get string from gensim. get words out of string. create weighted wordcloud for each topic. Merge all wordclouds to one image. \n",
        "\n",
        "* geeksforgeeks on string extraction [here](https://www.geeksforgeeks.org/python-extract-words-from-given-string/)\n",
        "* function from stack overflow [here](https://stackoverflow.com/questions/7633274/extracting-words-from-a-string-removing-punctuation-and-returning-a-list-with-s)\n",
        "\n",
        "\n",
        "cool stuff on strings [here](https://jakevdp.github.io/WhirlwindTourOfPython/14-strings-and-regular-expressions.html)\n",
        "\n",
        "\n",
        "\n",
        "\n",
        "Character\t| Description\t|\tCharacter\t|Description \n",
        "-------- | ------ | ------- | -----\n",
        "\"\\d\"\t| Match any digit\t|\t\"\\D\" |\tMatch any non-digit\n",
        "\"\\s\" |\tMatch any whitespace\t|\t\"\\S\"\t|Match any non-whitespace\n",
        "\"\\w\" |\tMatch any alphanumeric char\t |\t\"\\W\"\t|Match any non-alphanumeric char"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "0uKTA0avs0Y1"
      },
      "source": [
        "%%capture\n",
        "import re\n",
        "def getWords(text):\n",
        "    # replace \\w with [A-Za-z] for just letters, or [A-Za-z'] to include contractions\n",
        "    return re.compile('[A-Za-z\\']+').findall(text)\n",
        "def getNumbers(text):\n",
        "    # replace \\w with [A-Za-z] for just letters, or [A-Za-z'] to include contractions\n",
        "    return re.compile('[\\d]+').findall(text)\n",
        "\n",
        "topics = optimized_model.show_topics(num_topics=opt_topics, num_words=15)\n",
        "words_per_topic = []\n",
        "numbers_per_topic = []\n",
        "for top in topics:\n",
        "    # print(type(top[1]))\n",
        "    topic_words = getWords(top[1])\n",
        "    topic_nums = getNumbers(top[1])\n",
        "    topic_nums_conv = []\n",
        "    for item in topic_nums:\n",
        "        topic_nums_conv.append(int(item))\n",
        "        # topic_nums_conv.append(item.strip('0'))\n",
        "    words_per_topic.append(topic_words)\n",
        "    numbers_per_topic.append(topic_nums_conv)\n",
        "\n",
        "# remove extra zeros \n",
        "\n",
        "topic_numbers = []\n",
        "for topic in numbers_per_topic:\n",
        "    heavy_weights = []\n",
        "    for weight in topic:\n",
        "        if int(weight) > 0:\n",
        "            heavy_weights.append(weight)\n",
        "    topic_numbers.append(heavy_weights)\n",
        "\n"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "vfKOji3nQXGd"
      },
      "source": [
        "### print out topic model \n",
        "\n",
        "manual prinout"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "zBEoR-3dQVw-"
      },
      "source": [
        "%%capture\n",
        "\n",
        "# remove capture if you want to view and re-run cell\n",
        "print(\"\\n\\ntopics from optimal model: \")\n",
        "pp.pprint(topics)\n",
        "print(\"\\n\\nwords in each topic from optimal model: \")\n",
        "print(words_per_topic)\n",
        "print(\"\\n\\nweights for these words in each topic from optimal model: \")\n",
        "print(numbers_per_topic)\n"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "fRl4wqy71nwa"
      },
      "source": [
        "### prepare data & functions for wordclouds\n",
        "\n",
        "create dataframe with relative importance"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "5PB5WqE81mvZ"
      },
      "source": [
        "topic_store=[]\n",
        "word_store=[]\n",
        "importance_store=[]\n",
        "for topic_words in words_per_topic:\n",
        "    # because iterating through 'words_per_topic' the topic is a list\n",
        "    # contain words in the topic\n",
        "\n",
        "    # this is the topic number\n",
        "    location = words_per_topic.index(topic_words)\n",
        "    topic_word_weights = topic_numbers[location]\n",
        "\n",
        "    if len(topic_word_weights) > 0:\n",
        "        # some of the weights are NOT null\n",
        "\n",
        "        len_diff = len(topic_word_weights) - len(topic_words)\n",
        "\n",
        "        if len_diff < 0:\n",
        "            # apparently this happens.. no idea why\n",
        "            num_to_add = abs(len_diff)\n",
        "\n",
        "            for adder in range(num_to_add):\n",
        "                # add weight of one till matches\n",
        "                topic_word_weights.append(1)\n",
        "\n",
        "\n",
        "        for i in range(len(topic_words)):\n",
        "            topic_store.append(location)\n",
        "            word_store.append(topic_words[i])\n",
        "            importance_store.append(topic_word_weights[i])\n",
        "    else:\n",
        "        # all of the weights are null. interpret as they are all equal\n",
        "        for i in range(len(topic_words)):\n",
        "            topic_store.append(location)\n",
        "            word_store.append(topic_words[i])\n",
        "            importance_store.append(1)\n",
        "\n",
        "tm_dict = {\n",
        "    \"topic_ID\":topic_store,\n",
        "    \"word\":word_store,\n",
        "    \"weight\":importance_store\n",
        "}\n",
        "\n",
        "tm_df = pd.DataFrame(tm_dict).convert_dtypes()\n",
        "pp.pprint(tm_df.info())"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "x9zIcbKt7iBQ"
      },
      "source": [
        "\n",
        "\n",
        "---\n",
        "\n",
        "\n",
        "**function definitions**\n",
        "\n",
        "documentation from MatPlotLib on how to do things:\n",
        "* [link](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.html) to matplotlib full documentation\n",
        "* [on subplots](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.subplots.html#matplotlib.pyplot.subplots)\n",
        "* [saving images](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.savefig.html#matplotlib.pyplot.savefig)"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Vj3XB7Zj6FGR"
      },
      "source": [
        "%%capture\n",
        "!pip install -U wordcloud\n",
        "import wordcloud\n",
        "import matplotlib.pyplot as plt\n",
        "import math\n",
        "\n",
        "def plot_cloud(wordcloud, wtitle=None):\n",
        "    # Set figure size\n",
        "    plt.figure(figsize=(4, 3), dpi=200)\n",
        "    plt.title(wtitle)\n",
        "    # Display image\n",
        "    plt.imshow(wordcloud)\n",
        "    plt.tight_layout()\n",
        "    # No axis details\n",
        "    plt.axis(\"off\")\n",
        "\n",
        "def plot_several_clouds(wordcloud_list, group_title=None, verbose=False):\n",
        "    total_no = len(wordcloud_list)\n",
        "\n",
        "    golden_ratio = 1.61803398875\n",
        "    num_vertical = int(math.floor((total_no / golden_ratio)) ** (1/2))\n",
        "    num_across =  int(math.ceil(num_vertical * golden_ratio))\n",
        "\n",
        "    if num_vertical * num_across < total_no:\n",
        "        num_vertical = int(math.ceil(total_no / num_across))\n",
        "    if verbose:\n",
        "        print(\"Total number of clouds to be created is: \", total_no)\n",
        "        print(\"The plot will be {} wordclouds across, \".format(num_across),\n",
        "              \"and {} wordclouds long\".format(num_vertical))\n",
        "    # Set figure size\n",
        "    fig, axs = plt.subplots(num_across, num_vertical, sharex=True, sharey=True,\n",
        "                            figsize=(40, 30), dpi=200, facecolor='k', \n",
        "                            squeeze=True)\n",
        "\n",
        "    count=0\n",
        "\n",
        "    # decide font size based on how many vertical rows there are\n",
        "\n",
        "    desired_font = int(144 / num_vertical)\n",
        "    subtitles_dict = {'fontsize': desired_font,\n",
        "                      'color': 'w'\n",
        "    }\n",
        "    for i in range(num_across):\n",
        "        for j in range(num_vertical):\n",
        "            if verbose:\n",
        "                print(\"starting sub-plot run \", count)\n",
        "            axs[i, j].imshow(wordcloud_list[count], interpolation='nearest')\n",
        "            axs[i, j].set_title(label=group_title + \" \" + str(count),\n",
        "                                fontdict=subtitles_dict)\n",
        "            axs[i, j].set_axis_off()\n",
        "            count += 1\n",
        "\n",
        "            if count >= len(wordcloud_list):\n",
        "                break\n",
        "        if count >= len(wordcloud_list):\n",
        "                break\n",
        "        \n",
        "\n",
        "    plt.tight_layout(pad=0)\n",
        "    # No axis details\n",
        "    plt.axis(\"off\")\n",
        "    fig.suptitle(\"Wordclouds Illustrating Topic Model Results for: \" + dataset_name)\n",
        "\n",
        "    # save figure\n",
        "\n",
        "    tmchart_filename = \"topic_model_vizWC_\" + dataset_name + \".png\"\n",
        "    plt.savefig(join(out_directory, tmchart_filename), dpi=300, facecolor='k',\n",
        "                edgecolor='k', transparent=False, pad_inches=0.05)\n",
        "    \n",
        "    # download file from google drive\n",
        "\n",
        "    files.download(join(out_directory, tmchart_filename))"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "1Ag4KXERJynj"
      },
      "source": [
        "---\n",
        "### Plot the cloud\n",
        "\n",
        "You can change the colormap to any of the [matplotlib colormaps](https://matplotlib.org/stable/tutorials/colors/colormaps.html)\n",
        "\n",
        "Ones that seem to not look terrible are: Set3, Set2,  Pastel2, terrain, cividis\n",
        "\n",
        "<font color='orange'>*TO-DO: Check why wordcloud is not showing entire list of topics*\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "eb6_mcrH3-rB"
      },
      "source": [
        "wordcloud_store = []\n",
        "\n",
        "for j in tqdm(range(len(words_per_topic), total=len(words_per_topic),\n",
        "                    desc=\"generating word clouds\"):\n",
        "    wc_text = \"\"\n",
        "    this_df = tm_df[tm_df[\"topic_ID\"] == j]\n",
        "    for index, row in this_df.iterrows():\n",
        "        weighted_word = (row['word'] + \" \") * row['weight']\n",
        "        wc_text = wc_text + weighted_word\n",
        "    wordcloud_s = None\n",
        "    this_title = \"Topic # \" + str(j)\n",
        "    # change color map to whatever you want\n",
        "    wordcloud_s = wordcloud.WordCloud(random_state=69, \n",
        "                                      background_color='black', colormap='Set3',\n",
        "                                      collocations=False).generate(wc_text)\n",
        "                                    \n",
        "    # uncomment below to make individual plots\n",
        "    # plot_cloud(wordcloud_s, wtitle=this_title)\n",
        "    wordcloud_store.append(wordcloud_s)\n",
        "\n",
        "plot_several_clouds(wordcloud_store, \"Topic\", verbose=True)\n",
        "print(\"\\n\\n finished plotting clouds at: \", datetime.now())"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "foq0OUD4Vf_p"
      },
      "source": [
        "## Predict Topics\n",
        "\n",
        "predicts a topic for a given document"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "6kW2rsPPVigO"
      },
      "source": [
        "def topic_prediction(my_document):\n",
        "    string_input = [my_document]\n",
        "    X = vect.transform(string_input)\n",
        "    # Convert sparse matrix to gensim corpus.\n",
        "    corpus = gensim.matutils.Sparse2Corpus(X, documents_columns=False)\n",
        "    output = list(ldamodel[corpus])[0]\n",
        "    topics = sorted(output,key=lambda x:x[1],reverse=True)\n",
        "    return topics[0][0]\n",
        " \n",
        "# topic_prediction(my_document)"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Om3Xv3Y_TYdH"
      },
      "source": [
        "# other model abilities\n",
        "\n",
        "- [link](https://radimrehurek.com/gensim/models/ldamulticore.html) to LDA multicore docs\n",
        "\n",
        "```get_document_topics(bow, minimum_probability=None, minimum_phi_value=None, per_word_topics=False)```"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "n_XR5oSWSkCU"
      },
      "source": [
        "# Ideas for future\n",
        "\n",
        "1. additional viz methods like [wordcloud](https://www.machinelearningplus.com/nlp/topic-modeling-visualization-how-to-present-results-lda-models/#13.-t-SNE-Clustering-Chart)\n",
        "2. better viz, from ted underwood [link here](https://tedunderwood.com/2012/11/11/visualizing-topic-models/)\n",
        "3. further documentation from gensim\n",
        "    * [wikipedia experiments](https://radimrehurek.com/gensim/wiki.html#latent-dirichlet-allocation)\n",
        "    * [LDA tutorial](https://radimrehurek.com/gensim/auto_examples/tutorials/run_lda.html)\n",
        "4. link 4 [here](http://ethen8181.github.io/machine-learning/clustering/topic_model/LDA.html)\n",
        "5. Guide to Build Best LDA model using Gensim Python [here](https://thinkinfi.com/guide-to-build-best-lda-model-using-gensim-python/)\n",
        "6. Gensim LDA: Tips and Tricks mining the details [here](https://miningthedetails.com/blog/python/lda/GensimLDA/)\n"
      ]
    }
  ]
}