ryderwishart/grampiece.ipynb

## grampiece.ipynb
{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": [],
      "authorship_tag": "ABX9TyO0ZRkN/sHYoZFKHWqutc9G",
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/gist/ryderwishart/ae1b98eeb859bed918743881cac10faa/grampiece.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Create a toy corpus (Alice in Wonderland) for testing purposes to ensure the code can run through properly."
      ],
      "metadata": {
        "id": "3lzoGMeoaUyB"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "!wget https://www.gutenberg.org/files/11/11-0.txt -O corpus.txt"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "PbcHwC3waYYN",
        "outputId": "2abc31eb-40dd-41e8-b2ad-4a0738e44606"
      },
      "execution_count": 4,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "--2023-03-23 23:30:04--  https://www.gutenberg.org/files/11/11-0.txt\n",
            "Resolving www.gutenberg.org (www.gutenberg.org)... 152.19.134.47, 2610:28:3090:3000:0:bad:cafe:47\n",
            "Connecting to www.gutenberg.org (www.gutenberg.org)|152.19.134.47|:443... connected.\n",
            "HTTP request sent, awaiting response... 200 OK\n",
            "Length: 174313 (170K) [text/plain]\n",
            "Saving to: ‘corpus.txt’\n",
            "\n",
            "corpus.txt          100%[===================>] 170.23K  --.-KB/s    in 0.1s    \n",
            "\n",
            "2023-03-23 23:30:05 (1.62 MB/s) - ‘corpus.txt’ saved [174313/174313]\n",
            "\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "# Sanity check\n",
        "with open('corpus.txt', 'r') as f:\n",
        "    lines = f.readlines()\n",
        "    for line in lines[0:10]:\n",
        "        print(line.strip())"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "afWR3Uk5b7uc",
        "outputId": "f4ae3360-03ed-49b1-c79c-1fae5350ae65"
      },
      "execution_count": 5,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "The Project Gutenberg eBook of Alice’s Adventures in Wonderland, by Lewis Carroll\n",
            "\n",
            "This eBook is for the use of anyone anywhere in the United States and\n",
            "most other parts of the world at no cost and with almost no restrictions\n",
            "whatsoever. You may copy it, give it away or re-use it under the terms\n",
            "of the Project Gutenberg License included with this eBook or online at\n",
            "www.gutenberg.org. If you are not located in the United States, you\n",
            "will have to check the laws of the country where you are located before\n",
            "using this eBook.\n",
            "\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "1.  Unsupervised chunking using SentencePiece and clustering:\n",
        "    \n",
        "    -   Tokenize the corpus using SentencePiece.\n",
        "    -   Generate word embeddings for the tokens (e.g., using Word2Vec, GloVe, or FastText).\n",
        "    -   Apply a clustering algorithm (e.g., k-means, DBSCAN, or hierarchical clustering) to group similar tokens.\n",
        "    -   Create new \"chunk\" tokens representing each cluster.\n",
        "    -   Replace the original tokens in the corpus with the corresponding chunk tokens.\n",
        "\n",
        "\n",
        "*Note: you will need to tweak the SentencePiece vocab size and the number of clusters you are trying to find from those tokens.*"
      ],
      "metadata": {
        "id": "gn_EBFGaWVPn"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "!pip install sentencepiece gensim scikit-learn numpy"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "JvQRZ8E4ZfU8",
        "outputId": "fb519808-c636-4da1-8623-2fc76f072c72"
      },
      "execution_count": 6,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n",
            "Requirement already satisfied: sentencepiece in /usr/local/lib/python3.9/dist-packages (0.1.97)\n",
            "Requirement already satisfied: gensim in /usr/local/lib/python3.9/dist-packages (4.3.1)\n",
            "Requirement already satisfied: scikit-learn in /usr/local/lib/python3.9/dist-packages (1.2.2)\n",
            "Requirement already satisfied: numpy in /usr/local/lib/python3.9/dist-packages (1.22.4)\n",
            "Requirement already satisfied: smart-open>=1.8.1 in /usr/local/lib/python3.9/dist-packages (from gensim) (6.3.0)\n",
            "Requirement already satisfied: scipy>=1.7.0 in /usr/local/lib/python3.9/dist-packages (from gensim) (1.10.1)\n",
            "Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.9/dist-packages (from scikit-learn) (3.1.0)\n",
            "Requirement already satisfied: joblib>=1.1.1 in /usr/local/lib/python3.9/dist-packages (from scikit-learn) (1.1.1)\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 7,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "q9ukVqUnV6jD",
        "outputId": "6c3fa3f3-6053-4682-abe5-9357356c95d9"
      },
      "outputs": [
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "/usr/local/lib/python3.9/dist-packages/sklearn/cluster/_kmeans.py:870: FutureWarning: The default value of `n_init` will change from 10 to 'auto' in 1.4. Set the value of `n_init` explicitly to suppress the warning\n",
            "  warnings.warn(\n"
          ]
        },
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "First 10 chunked sentences:\n",
            "1. <chunk_118> <chunk_91> <chunk_28> <chunk_493> <chunk_142> <chunk_49> <chunk_55> <chunk_113> <chunk_204> <chunk_193> <chunk_169> <chunk_137> <chunk_355> <chunk_280> <chunk_122> <chunk_131> <chunk_298> <chunk_439> <chunk_246> <chunk_257>\n",
            "2. \n",
            "3. <chunk_31> <chunk_493> <chunk_252> <chunk_158> <chunk_115> <chunk_259> <chunk_142> <chunk_239> <chunk_149> <chunk_239> <chunk_302> <chunk_193> <chunk_115> <chunk_190> <chunk_379> <chunk_221>\n",
            "4. <chunk_266> <chunk_347> <chunk_35> <chunk_113> <chunk_142> <chunk_115> <chunk_289> <chunk_92> <chunk_417> <chunk_283> <chunk_339> <chunk_221> <chunk_196> <chunk_362> <chunk_487> <chunk_417> <chunk_23> <chunk_16> <chunk_388> <chunk_186> <chunk_499> <chunk_113>\n",
            "5. <chunk_240> <chunk_113> <chunk_150> <chunk_282> <chunk_125> <chunk_334> <chunk_324> <chunk_34> <chunk_225> <chunk_137> <chunk_89> <chunk_225> <chunk_325> <chunk_9> <chunk_210> <chunk_107> <chunk_119> <chunk_122> <chunk_225> <chunk_262> <chunk_115> <chunk_211>\n",
            "6. <chunk_142> <chunk_115> <chunk_91> <chunk_28> <chunk_485> <chunk_96> <chunk_157> <chunk_173> <chunk_196> <chunk_209> <chunk_493> <chunk_9> <chunk_182> <chunk_29> <chunk_333> <chunk_92>\n",
            "7. <chunk_8> <chunk_131> <chunk_131> <chunk_125> <chunk_136> <chunk_364> <chunk_125> <chunk_99> <chunk_136> <chunk_125> <chunk_86> <chunk_98> <chunk_241> <chunk_140> <chunk_279> <chunk_199> <chunk_193> <chunk_115> <chunk_190> <chunk_379> <chunk_137> <chunk_98>\n",
            "8. <chunk_457> <chunk_441> <chunk_114> <chunk_94> <chunk_115> <chunk_211> <chunk_113> <chunk_142> <chunk_115> <chunk_426> <chunk_134> <chunk_2> <chunk_98> <chunk_241> <chunk_279> <chunk_199> <chunk_303>\n",
            "9. <chunk_315> <chunk_27> <chunk_209> <chunk_493> <chunk_125>\n",
            "10. \n",
            "\n",
            "Clusters:\n",
            "Cluster 0: ▁tak\n",
            "Cluster 1: ▁only\n",
            "Cluster 2: ▁How, ▁where, ssion\n",
            "Cluster 3: ▁were\n",
            "Cluster 4: ▁ear, ▁anxiously, ▁answer\n",
            "Cluster 5: ▁two, ▁minute\n",
            "Cluster 6: ▁uncomfortabl, 64, 596\n",
            "Cluster 7: ▁*\n",
            "Cluster 8: ▁w\n",
            "Cluster 9: ▁or\n",
            "Cluster 10: ▁still, ▁run, ▁few, ▁each\n",
            "Cluster 11: ▁they\n",
            "Cluster 12: ▁side\n",
            "Cluster 13: ▁now, ▁first\n",
            "Cluster 14: ▁history, ▁accept\n",
            "Cluster 15: ▁de, ight, led\n",
            "Cluster 16: r\n",
            "Cluster 17: ▁here\n",
            "Cluster 18: ▁fan, ▁child, ▁grin, ▁animal\n",
            "Cluster 19: R\n",
            "Cluster 20: it\n",
            "Cluster 21: ▁mouse, ▁life, Off, ▁kind, way\n",
            "Cluster 22: se, ▁an\n",
            "Cluster 23: ▁feet, ▁air, ▁own, ▁indeed, ▁itself\n",
            "Cluster 24: ▁“\n",
            "Cluster 25: E\n",
            "Cluster 26: ▁must\n",
            "Cluster 27: ing\n",
            "Cluster 28: ▁Gutenberg\n",
            "Cluster 29: l\n",
            "Cluster 30: ▁D\n",
            "Cluster 31: ▁ga, ▁This, ▁people\n",
            "Cluster 32: ▁across\n",
            "Cluster 33: ▁little\n",
            "Cluster 34: ▁getting, ▁open, ▁gra, Who, ▁copy\n",
            "Cluster 35: cause, ▁tea, ▁part, ▁copyright, ▁deal\n",
            "Cluster 36: I\n",
            "Cluster 37: ▁when\n",
            "Cluster 38: ▁than\n",
            "Cluster 39: ▁she\n",
            "Cluster 40: ▁girl, ▁altogether\n",
            "Cluster 41: ate, ld, ▁near, ies\n",
            "Cluster 42: ▁replied\n",
            "Cluster 43: *\n",
            "Cluster 44: ▁time, ▁say, ▁quite\n",
            "Cluster 45: fully\n",
            "Cluster 46: ▁mean, ▁change, ▁yet, ▁ever, ook\n",
            "Cluster 47: ▁said\n",
            "Cluster 48: ▁I\n",
            "Cluster 49: ▁Alice\n",
            "Cluster 50: ▁King, ▁make\n",
            "Cluster 51: ▁sitting, ▁rate, ▁middle, ▁forgot, ▁understand\n",
            "Cluster 52: ▁such\n",
            "Cluster 53: ▁finished, ▁has\n",
            "Cluster 54: T\n",
            "Cluster 55: ’\n",
            "Cluster 56: ▁sneeze\n",
            "Cluster 57: ▁set, ▁till\n",
            "Cluster 58: ▁his\n",
            "Cluster 59: al\n",
            "Cluster 60: _\n",
            "Cluster 61: tm\n",
            "Cluster 62: ▁perform, ▁pepper\n",
            "Cluster 63: ▁sure, ▁believe\n",
            "Cluster 64: ▁pa\n",
            "Cluster 65: ▁would\n",
            "Cluster 66: $\n",
            "Cluster 67: ▁glad, ▁birds, ▁mushroom, wards, ▁throw\n",
            "Cluster 68: :\n",
            "Cluster 69: Would, 9, ▁twice\n",
            "Cluster 70: ▁came, ▁g\n",
            "Cluster 71: And, ▁live, ▁done, ▁learn, ▁place\n",
            "Cluster 72: ▁trying, ▁watch, ▁stoo, ▁distribute, ▁new\n",
            "Cluster 73: if, ten\n",
            "Cluster 74: ▁book, ▁direction, xactly, ▁cross, ▁promi\n",
            "Cluster 75: ▁O\n",
            "Cluster 76: b\n",
            "Cluster 77: er\n",
            "Cluster 78: ▁ran, ▁let, ▁Soup, ▁gone, ▁Literary\n",
            "Cluster 79: ▁that\n",
            "Cluster 80: ,”\n",
            "Cluster 81: .”\n",
            "Cluster 82: ▁one\n",
            "Cluster 83: ▁my\n",
            "Cluster 84: ND\n",
            "Cluster 85: ▁_\n",
            "Cluster 86: ▁arm, ▁If\n",
            "Cluster 87: ▁flamingo, ▁tarts, ▁effort, ▁defect, ▁plate\n",
            "Cluster 88: ▁c\n",
            "Cluster 89: om, ▁give, ▁di, ▁sleep, ail\n",
            "Cluster 90: ▁f\n",
            "Cluster 91: ▁Project\n",
            "Cluster 92: ▁at\n",
            "Cluster 93: el\n",
            "Cluster 94: ▁children, ▁check, ▁bla\n",
            "Cluster 95: ▁another\n",
            "Cluster 96: ▁\n",
            "Cluster 97: Then, ▁order, ▁together\n",
            "Cluster 98: ▁you\n",
            "Cluster 99: or\n",
            "Cluster 100: —\n",
            "Cluster 101: ▁if\n",
            "Cluster 102: D\n",
            "Cluster 103: en\n",
            "Cluster 104: ?”\n",
            "Cluster 105: ▁conversation, ude, ▁exce, ▁neck\n",
            "Cluster 106: !\n",
            "Cluster 107: -\n",
            "Cluster 108: our\n",
            "Cluster 109: ▁fact, ▁mine, ▁taste, ▁sudden, ▁wrong\n",
            "Cluster 110: ▁Mock\n",
            "Cluster 111: ▁b\n",
            "Cluster 112: ive\n",
            "Cluster 113: s\n",
            "Cluster 114: ▁to\n",
            "Cluster 115: ▁the\n",
            "Cluster 116: ▁sh\n",
            "Cluster 117: ▁face, ▁mouth, ▁beginning, ▁room, ▁reason\n",
            "Cluster 118: ▁The\n",
            "Cluster 119: us\n",
            "Cluster 120: ▁does, ▁size\n",
            "Cluster 121: m\n",
            "Cluster 122: e\n",
            "Cluster 123: ▁over\n",
            "Cluster 124: ▁everything, ▁sometimes\n",
            "Cluster 125: .\n",
            "Cluster 126: ▁wood, ▁asleep, ught, Which, ▁tax\n",
            "Cluster 127: ▁them\n",
            "Cluster 128: H\n",
            "Cluster 129: ▁off\n",
            "Cluster 130: If, There, ▁cur, ▁rule\n",
            "Cluster 131: w\n",
            "Cluster 132: ll\n",
            "Cluster 133: ▁serpent, ▁sorrow, ▁carrie\n",
            "Cluster 134: y\n",
            "Cluster 135: ▁M\n",
            "Cluster 136: g\n",
            "Cluster 137: ,\n",
            "Cluster 138: ▁lobster, ▁either, ▁gui, ▁particular, ▁Lizard\n",
            "Cluster 139: ▁up, ▁about\n",
            "Cluster 140: ▁not\n",
            "Cluster 141: a\n",
            "Cluster 142: ▁of\n",
            "Cluster 143: ▁thought\n",
            "Cluster 144: !”\n",
            "Cluster 145: ▁A\n",
            "Cluster 146: ▁more\n",
            "Cluster 147: ▁he\n",
            "Cluster 148: ▁be\n",
            "Cluster 149: one, oo, ▁He\n",
            "Cluster 150: o\n",
            "Cluster 151: le\n",
            "Cluster 152: ▁afraid, ▁nice, ', ▁decided, ▁English\n",
            "Cluster 153: ▁Gryphon, ▁thing\n",
            "Cluster 154: ence, ▁Dodo, ▁always, ▁different, ▁waiting\n",
            "Cluster 155: C, G\n",
            "Cluster 156: ▁‘\n",
            "Cluster 157: i\n",
            "Cluster 158: ▁for\n",
            "Cluster 159: ck\n",
            "Cluster 160: Come, ▁curious\n",
            "Cluster 161: li\n",
            "Cluster 162: ▁know\n",
            "Cluster 163: ▁remark, ▁seemed\n",
            "Cluster 164: O\n",
            "Cluster 165: ▁smil, ▁home, clock, 7\n",
            "Cluster 166: No, ▁shall, P\n",
            "Cluster 167: ie\n",
            "Cluster 168: ▁tears, ▁verse\n",
            "Cluster 169: ▁hedgehog, ▁distance, Just, ▁cut, ▁hot\n",
            "Cluster 170: ▁Turtle\n",
            "Cluster 171: ce\n",
            "Cluster 172: u\n",
            "Cluster 173: ▁frightened, ncluded, (\n",
            "Cluster 174: ▁don\n",
            "Cluster 175: ▁But, ▁For\n",
            "Cluster 176: he\n",
            "Cluster 177: ▁venture, ▁golden, ▁distribution, ▁roof, 5\n",
            "Cluster 178: ▁lo, ▁puzzl, “\n",
            "Cluster 179: ▁had\n",
            "Cluster 180: S\n",
            "Cluster 181: x\n",
            "Cluster 182: ▁on\n",
            "Cluster 183: ▁put, ▁once\n",
            "Cluster 184: ▁back, ▁without\n",
            "Cluster 185: ▁as\n",
            "Cluster 186: t\n",
            "Cluster 187: ▁game, ju, ▁jump, ▁creatures, ?_”\n",
            "Cluster 188: p\n",
            "Cluster 189: ches, X, ▁gardeners\n",
            "Cluster 190: ▁ask, ▁United, ▁hold, ▁ye, 3\n",
            "Cluster 191: ▁small, ▁called, ▁name\n",
            "Cluster 192: ▁can\n",
            "Cluster 193: ▁in\n",
            "Cluster 194: ▁work\n",
            "Cluster 195: ▁out\n",
            "Cluster 196: ▁with\n",
            "Cluster 197: ve\n",
            "Cluster 198: c\n",
            "Cluster 199: ed\n",
            "Cluster 200: ▁looking, ▁heard\n",
            "Cluster 201: ▁used, ▁donations, ▁Majesty, ▁state, ▁person\n",
            "Cluster 202: ▁manage, ▁exclaimed, ▁clear, ▁procession, ▁plan\n",
            "Cluster 203: at\n",
            "Cluster 204: ▁Adventures, ▁argument, ▁expect, ▁escape, civil\n",
            "Cluster 205: ▁do\n",
            "Cluster 206: ▁s\n",
            "Cluster 207: A\n",
            "Cluster 208: ▁1.\n",
            "Cluster 209: ▁this\n",
            "Cluster 210: ▁re\n",
            "Cluster 211: ▁terms, ▁walk, ▁court, ▁law, ▁repeat\n",
            "Cluster 212: ▁all\n",
            "Cluster 213: ▁Cat, ▁jury, ▁anything, ▁hastily, ▁want\n",
            "Cluster 214: ▁works\n",
            "Cluster 215: What\n",
            "Cluster 216: L\n",
            "Cluster 217: ▁a\n",
            "Cluster 218: CHA\n",
            "Cluster 219: ly\n",
            "Cluster 220: the\n",
            "Cluster 221: ▁and\n",
            "Cluster 222: an\n",
            "Cluster 223: ▁Duchess, ▁next\n",
            "Cluster 224: ▁its\n",
            "Cluster 225: ▁it\n",
            "Cluster 226: ▁nothing, ▁going, ▁same, ▁sea, ▁high\n",
            "Cluster 227: h\n",
            "Cluster 228: in\n",
            "Cluster 229: ▁upon\n",
            "Cluster 230: ▁trembl, ▁fancy, ▁melancholy, ▁knee\n",
            "Cluster 231: ▁find, ▁hard, ▁far\n",
            "Cluster 232: ▁low, ▁free, ▁slate, ▁join, ▁try\n",
            "Cluster 233: n\n",
            "Cluster 234: ▁her\n",
            "Cluster 235: ▁so\n",
            "Cluster 236: ▁agree, ▁pool, ▁plea, ▁Section, ▁shut\n",
            "Cluster 237: ▁Hare\n",
            "Cluster 238: ▁Rabbit\n",
            "Cluster 239: ▁any\n",
            "Cluster 240: ▁what\n",
            "Cluster 241: ▁are, ▁your\n",
            "Cluster 242: ▁into\n",
            "Cluster 243: ▁W\n",
            "Cluster 244: and\n",
            "Cluster 245: ▁queer, ▁fall, ▁hurry, ▁sister, ▁glass\n",
            "Cluster 246: ar\n",
            "Cluster 247: ▁There, ▁hear, ▁White\n",
            "Cluster 248: ▁com, ▁fl\n",
            "Cluster 249: ow\n",
            "Cluster 250: ▁keep, ▁appear, ▁many, Here, ▁wor\n",
            "Cluster 251: ER\n",
            "Cluster 252: ▁is\n",
            "Cluster 253: ▁p\n",
            "Cluster 254: id\n",
            "Cluster 255: ▁friend, ▁quietly, ▁stick\n",
            "Cluster 256: ▁think\n",
            "Cluster 257: roll, Alice, ▁everybody, ▁confused\n",
            "Cluster 258: ▁limit, ▁deriv, ▁proper, ▁judg\n",
            "Cluster 259: ▁door, ▁use\n",
            "Cluster 260: ▁pardon, ▁trouble, ▁picture, Q\n",
            "Cluster 261: ▁bottom\n",
            "Cluster 262: ▁under, ▁good\n",
            "Cluster 263: ’”, not, less, ▁agreement, ity\n",
            "Cluster 264: ▁very\n",
            "Cluster 265: PT\n",
            "Cluster 266: ▁spoke, ▁play, ▁most, ▁leave, nea\n",
            "Cluster 267: ok\n",
            "Cluster 268: op\n",
            "Cluster 269: ▁listen, ▁matter, ▁timidly, ▁shouted, ▁eye\n",
            "Cluster 270: ▁could\n",
            "Cluster 271: ?\n",
            "Cluster 272: ling, ▁br\n",
            "Cluster 273: ▁see\n",
            "Cluster 274: ▁cook, ▁crowd, ▁second\n",
            "Cluster 275: ▁turn, ▁found, ▁while\n",
            "Cluster 276: ▁st\n",
            "Cluster 277: ch\n",
            "Cluster 278: ▁head\n",
            "Cluster 279: ▁whisper, ▁business, ▁locat, ▁begun, ▁collect\n",
            "Cluster 280: ▁L\n",
            "Cluster 281: ▁Dormouse\n",
            "Cluster 282: ever, —”\n",
            "Cluster 283: ble, am, ▁co\n",
            "Cluster 284: ▁well\n",
            "Cluster 285: ▁too, ▁writ, ently\n",
            "Cluster 286: es\n",
            "Cluster 287: ▁never\n",
            "Cluster 288: ▁got\n",
            "Cluster 289: ▁world, out\n",
            "Cluster 290: ▁great\n",
            "Cluster 291: q\n",
            "Cluster 292: You\n",
            "Cluster 293: d\n",
            "Cluster 294: able\n",
            "Cluster 295: ▁So\n",
            "Cluster 296: ng, est\n",
            "Cluster 297: ▁dis\n",
            "Cluster 298: is\n",
            "Cluster 299: ▁voice\n",
            "Cluster 300: ▁behind, ▁permi\n",
            "Cluster 301: ▁suddenly\n",
            "Cluster 302: Let, ‘, ▁bottle, where\n",
            "Cluster 303: ▁before\n",
            "Cluster 304: mp\n",
            "Cluster 305: ▁began\n",
            "Cluster 306: v\n",
            "Cluster 307: ▁did\n",
            "Cluster 308: ▁subject, ▁silent, ehead, ▁shoes, ▁finish\n",
            "Cluster 309: \", read\n",
            "Cluster 310: ▁long, ▁three\n",
            "Cluster 311: ▁electronic\n",
            "Cluster 312: How, ▁asked, ▁talking, even, ▁baby\n",
            "Cluster 313: ▁herself\n",
            "Cluster 314: ▁tone\n",
            "Cluster 315: ▁us, ept\n",
            "Cluster 316: Why, ▁words, ▁bit, ▁tried\n",
            "Cluster 317: ▁number, ▁paper, ▁reach\n",
            "Cluster 318: ▁moment, ▁read\n",
            "Cluster 319: Oh\n",
            "Cluster 320: ▁enough, ▁garden\n",
            "Cluster 321: ▁sa\n",
            "Cluster 322: ▁sort\n",
            "Cluster 323: ▁went\n",
            "Cluster 324: ▁may\n",
            "Cluster 325: ig, ink, ▁away, ▁Foundation\n",
            "Cluster 326: ▁water, ▁teacup, Project, ▁shrink, ▁obtain\n",
            "Cluster 327: F\n",
            "Cluster 328: on\n",
            "Cluster 329: ▁talk, ”, They, 6, ▁GUTENBERG\n",
            "Cluster 330: ▁Knave, ▁execute, ▁school, ▁eggs, ▁WARRANT\n",
            "Cluster 331: th\n",
            "Cluster 332: ▁notice, ▁fur, ▁else, ▁nearly, ▁Pigeon\n",
            "Cluster 333: ine\n",
            "Cluster 334: ▁You\n",
            "Cluster 335: ▁She\n",
            "Cluster 336: ▁there\n",
            "Cluster 337: ish\n",
            "Cluster 338: ▁even, ▁something, ▁seen\n",
            "Cluster 339: st\n",
            "Cluster 340: ▁but\n",
            "Cluster 341: ther\n",
            "Cluster 342: N\n",
            "Cluster 343: ▁S, ▁P\n",
            "Cluster 344: ▁was\n",
            "Cluster 345: app, round, ▁angrily, ▁window\n",
            "Cluster 346: ▁provide, ▁editions, ▁dropped\n",
            "Cluster 347: ▁other\n",
            "Cluster 348: ▁Hatter\n",
            "Cluster 349: explanation, ▁Beau\n",
            "Cluster 350: ▁table, ▁hurried, ▁dance, ▁interrupt, ▁suppose\n",
            "Cluster 351: ▁me, ▁which\n",
            "Cluster 352: ;\n",
            "Cluster 353: ▁added\n",
            "Cluster 354: ▁gloves, ▁dream\n",
            "Cluster 355: ▁by\n",
            "Cluster 356: Please, ▁shrill, ▁knock\n",
            "Cluster 357: re\n",
            "Cluster 358: but, ▁things, ▁cat\n",
            "Cluster 359: ▁usual, ▁rabbit, ▁First, ▁nervous, ▁strange\n",
            "Cluster 360: ▁(\n",
            "Cluster 361: 0\n",
            "Cluster 362: ▁al\n",
            "Cluster 363: j, ▁mind, ▁those, come, ▁twinkle\n",
            "Cluster 364: utenberg, ▁explain, ▁hun, ▁rose, ▁difficult\n",
            "Cluster 365: ▁last\n",
            "Cluster 366: ▁go\n",
            "Cluster 367: ▁made\n",
            "Cluster 368: ▁un\n",
            "Cluster 369: ▁hand, ▁cr\n",
            "Cluster 370: It\n",
            "Cluster 371: ment\n",
            "Cluster 372: ▁look\n",
            "Cluster 373: That, you\n",
            "Cluster 374: B\n",
            "Cluster 375: ▁tail, ▁trademark, ▁croquet, ▁outside\n",
            "Cluster 376: k\n",
            "Cluster 377: ▁ought, ▁every, ▁sound, ▁word\n",
            "Cluster 378: butter, box, ▁yourself, ▁distributing, ▁hair\n",
            "Cluster 379: ▁Mouse, ▁begin, ▁through, ▁States, ▁certainly\n",
            "Cluster 380: ▁T\n",
            "Cluster 381: she, ▁U\n",
            "Cluster 382: ▁ma\n",
            "Cluster 383: ▁why, ▁chimney, some\n",
            "Cluster 384: ▁aloud, ▁Mabel, 887\n",
            "Cluster 385: W\n",
            "Cluster 386: ation\n",
            "Cluster 387: ▁some\n",
            "Cluster 388: ic\n",
            "Cluster 389: ▁just\n",
            "Cluster 390: ▁Queen\n",
            "Cluster 391: ▁round\n",
            "Cluster 392: ▁hall, ▁protect, ▁slowly, ▁license, ectly\n",
            "Cluster 393: ▁tell\n",
            "Cluster 394: ▁March\n",
            "Cluster 395: as\n",
            "Cluster 396: ▁saying, what, ▁charge\n",
            "Cluster 397: Y\n",
            "Cluster 398: ting, ir\n",
            "Cluster 399: ▁might\n",
            "Cluster 400: ▁day, ▁right, ▁dr\n",
            "Cluster 401: z\n",
            "Cluster 402: ▁ba\n",
            "Cluster 403: ul\n",
            "Cluster 404: ▁down\n",
            "Cluster 405: Well\n",
            "Cluster 406: ▁looked\n",
            "Cluster 407: ▁old\n",
            "Cluster 408: ▁large\n",
            "Cluster 409: ù, %, Z\n",
            "Cluster 410: ▁k\n",
            "Cluster 411: ▁much\n",
            "Cluster 412: fter, ▁mile\n",
            "Cluster 413: ▁F\n",
            "Cluster 414: ▁course\n",
            "Cluster 415: ▁moral, ▁volunteer, ▁certain\n",
            "Cluster 416: ▁speak, ▁access\n",
            "Cluster 417: ▁no\n",
            "Cluster 418: The\n",
            "Cluster 419: ▁surprise, ▁chin, ▁hope, ▁treacle, verdict\n",
            "Cluster 420: ▁remember\n",
            "Cluster 421: ▁though, ▁key, ▁sentence, ▁Hearts, ▁wide\n",
            "Cluster 422: ▁fee, ance\n",
            "Cluster 423: ▁ca\n",
            "Cluster 424: ▁please\n",
            "Cluster 425: ▁we\n",
            "Cluster 426: ▁countr, ▁swam, would\n",
            "Cluster 427: for, Do, ▁Bill\n",
            "Cluster 428: ▁happen, ▁close, ▁both\n",
            "Cluster 429: up\n",
            "Cluster 430: ▁eyes, ▁being\n",
            "Cluster 431: ▁copie, ▁including, ▁delight, ▁solemn, [\n",
            "Cluster 432: ▁from\n",
            "Cluster 433: ▁generally, Hol\n",
            "Cluster 434: ▁dear\n",
            "Cluster 435: ▁their\n",
            "Cluster 436: ▁after\n",
            "Cluster 437: V\n",
            "Cluster 438: ▁should, ful\n",
            "Cluster 439: ▁C, ▁con\n",
            "Cluster 440: ▁Lory, ▁nurs, ▁guess, Beautiful, ▁angry\n",
            "Cluster 441: ▁have\n",
            "Cluster 442: ▁won\n",
            "Cluster 443: 1\n",
            "Cluster 444: ▁E\n",
            "Cluster 445: ▁rather, ▁cried\n",
            "Cluster 446: ake, full\n",
            "Cluster 447: ), ▁who\n",
            "Cluster 448: ▁him\n",
            "Cluster 449: ▁poo\n",
            "Cluster 450: ▁like\n",
            "Cluster 451: ▁soon, ▁follow, ▁better\n",
            "Cluster 452: ▁scream, ▁format, ▁fun, ▁between, ▁interesting\n",
            "Cluster 453: ▁way\n",
            "Cluster 454: ▁dare, 4, ▁passed, ▁morning, ▁figure\n",
            "Cluster 455: K\n",
            "Cluster 456: ▁It, ▁half\n",
            "Cluster 457: ▁will\n",
            "Cluster 458: ▁whole\n",
            "Cluster 459: ▁present, ▁grunt, ▁pick\n",
            "Cluster 460: ▁loud, ▁receiv, ▁comply\n",
            "Cluster 461: that\n",
            "Cluster 462: ▁been\n",
            "Cluster 463: prove, ▁curiosity, 54\n",
            "Cluster 464: ▁mo\n",
            "Cluster 465: very, ▁associated\n",
            "Cluster 466: ▁knew, ▁deep\n",
            "Cluster 467: ur\n",
            "Cluster 468: ent\n",
            "Cluster 469: ▁take, ▁brea\n",
            "Cluster 470: ▁end, ▁foot, ough, 2, J\n",
            "Cluster 471: rch\n",
            "Cluster 472: ▁Caterpillar\n",
            "Cluster 473: ▁how\n",
            "Cluster 474: ▁get\n",
            "Cluster 475: ▁again\n",
            "Cluster 476: f\n",
            "Cluster 477: ▁help, ▁execution, air, ▁bright, ▁really\n",
            "Cluster 478: ous\n",
            "Cluster 479: ▁question\n",
            "Cluster 480: ▁And\n",
            "Cluster 481: ▁l\n",
            "Cluster 482: ▁pro\n",
            "Cluster 483: ▁sigh, ▁house\n",
            "Cluster 484: ▁felt\n",
            "Cluster 485: ▁License\n",
            "Cluster 486: ▁come\n",
            "Cluster 487: ▁seem, Of, most, ▁appl\n",
            "Cluster 488: ▁fell\n",
            "Cluster 489: #\n",
            "Cluster 490: U\n",
            "Cluster 491: ▁address, 84, 99, $5\n",
            "Cluster 492: But\n",
            "Cluster 493: ▁eBook, ▁left\n",
            "Cluster 494: ▁then\n",
            "Cluster 495: ▁wonder\n",
            "Cluster 496: M\n",
            "Cluster 497: ▁grow\n",
            "Cluster 498: ▁eat\n",
            "Cluster 499: ion\n"
          ]
        }
      ],
      "source": [
        "import sentencepiece as spm\n",
        "import gensim\n",
        "import numpy as np\n",
        "from sklearn.cluster import KMeans\n",
        "\n",
        "# Train a SentencePiece model\n",
        "spm.SentencePieceTrainer.train(input='corpus.txt', model_prefix='spm_model', vocab_size=1000)\n",
        "sp = spm.SentencePieceProcessor()\n",
        "sp.load('spm_model.model')\n",
        "\n",
        "# Tokenize the corpus using SentencePiece\n",
        "with open('corpus.txt', 'r', encoding='utf-8') as f:\n",
        "    corpus = f.read()\n",
        "tokenized_corpus = [sp.encode_as_pieces(line) for line in corpus.split('\\n')]\n",
        "\n",
        "# Generate word embeddings for the tokens using Word2Vec\n",
        "model = gensim.models.Word2Vec(sentences=tokenized_corpus, vector_size=100, window=5, min_count=1, workers=4)\n",
        "\n",
        "# Prepare token vectors for clustering\n",
        "unique_tokens = list(model.wv.key_to_index.keys())\n",
        "token_vectors = np.array([model.wv[token] for token in unique_tokens])\n",
        "\n",
        "# Apply a clustering algorithm (e.g., k-means) to group similar tokens\n",
        "n_clusters = 500 # e.g., 1000?\n",
        "kmeans = KMeans(n_clusters=n_clusters, random_state=42)\n",
        "kmeans.fit(token_vectors)\n",
        "\n",
        "# Create new \"chunk\" tokens representing each cluster\n",
        "cluster_to_chunk = {i: f'<chunk_{i}>' for i in range(n_clusters)}\n",
        "token_to_chunk = {token: cluster_to_chunk[cluster_id] for token, cluster_id in zip(unique_tokens, kmeans.labels_)}\n",
        "\n",
        "# # Replace the original tokens in the corpus with the corresponding chunk tokens\n",
        "# chunked_corpus = []\n",
        "# for line in tokenized_corpus:\n",
        "#     chunked_line = [token_to_chunk.get(token, token) for token in line]\n",
        "#     chunked_corpus.append(chunked_line)\n",
        "\n",
        "# # Example output\n",
        "# print(chunked_corpus[:5])\n",
        "\n",
        "# Replace the original tokens in the corpus with the corresponding chunk tokens\n",
        "chunked_corpus = []\n",
        "for line in tokenized_corpus:\n",
        "    chunked_line = [token_to_chunk.get(token, token) for token in line]\n",
        "    chunked_corpus.append(chunked_line)\n",
        "\n",
        "# Print the first 10 sentences in a human-readable format\n",
        "print(\"First 10 chunked sentences:\")\n",
        "for i, line in enumerate(chunked_corpus[:10]):\n",
        "    print(f\"{i + 1}. {' '.join(line)}\")\n",
        "\n",
        "# Create a mapping from each cluster to a list of tokens that belong to it\n",
        "cluster_to_tokens = {i: [] for i in range(n_clusters)}\n",
        "for token, cluster_id in zip(unique_tokens, kmeans.labels_):\n",
        "    cluster_to_tokens[cluster_id].append(token)\n",
        "\n",
        "# Print the clusters with the first few tokens in each cluster\n",
        "print(\"\\nClusters:\")\n",
        "for i in range(n_clusters):\n",
        "    print(f\"Cluster {i}: {', '.join(cluster_to_tokens[i][:5])}\")\n"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "2.  Unsupervised approach using attention and global vector model:\n",
        "    \n",
        "    -   Tokenize the corpus using SentencePiece.\n",
        "    -   Generate a global vector model for the tokens (e.g., using Word2Vec, GloVe, or FastText).\n",
        "    -   Train an attention mechanism that identifies sets of highly related tokens based on the global vector model.\n",
        "    -   Combine the related tokens into chunks.\n",
        "    -   Replace the original tokens in the corpus with the corresponding chunk tokens."
      ],
      "metadata": {
        "id": "yvRFwx2tWQTt"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "!pip install sentencepiece gensim numpy torch"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "Mgw3mcF1Zgmj",
        "outputId": "8ca1ca6b-d6ff-4a06-8336-a5ea4c9bd636"
      },
      "execution_count": 8,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n",
            "Requirement already satisfied: sentencepiece in /usr/local/lib/python3.9/dist-packages (0.1.97)\n",
            "Requirement already satisfied: gensim in /usr/local/lib/python3.9/dist-packages (4.3.1)\n",
            "Requirement already satisfied: numpy in /usr/local/lib/python3.9/dist-packages (1.22.4)\n",
            "Requirement already satisfied: torch in /usr/local/lib/python3.9/dist-packages (1.13.1+cu116)\n",
            "Requirement already satisfied: smart-open>=1.8.1 in /usr/local/lib/python3.9/dist-packages (from gensim) (6.3.0)\n",
            "Requirement already satisfied: scipy>=1.7.0 in /usr/local/lib/python3.9/dist-packages (from gensim) (1.10.1)\n",
            "Requirement already satisfied: typing-extensions in /usr/local/lib/python3.9/dist-packages (from torch) (4.5.0)\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "import sentencepiece as spm\n",
        "import gensim\n",
        "import numpy as np\n",
        "import torch\n",
        "import torch.nn as nn\n",
        "from torch.optim import Adam\n",
        "\n",
        "# Train a SentencePiece model\n",
        "spm.SentencePieceTrainer.train(input='corpus.txt', model_prefix='spm_model', vocab_size=1000)\n",
        "sp = spm.SentencePieceProcessor()\n",
        "sp.load('spm_model.model')\n",
        "\n",
        "# Tokenize the corpus using SentencePiece\n",
        "with open('corpus.txt', 'r', encoding='utf-8') as f:\n",
        "    corpus = f.read()\n",
        "tokenized_corpus = [sp.encode_as_pieces(line) for line in corpus.split('\\n')]\n",
        "\n",
        "# Generate a global vector model for the tokens using Word2Vec\n",
        "model = gensim.models.Word2Vec(sentences=tokenized_corpus, vector_size=100, window=5, min_count=1, workers=4)\n",
        "\n",
        "# Convert token vectors to PyTorch tensor\n",
        "unique_tokens = list(model.wv.key_to_index.keys())\n",
        "token_vectors = torch.tensor([model.wv[token] for token in unique_tokens])\n",
        "\n",
        "# Define attention mechanism\n",
        "class Attention(nn.Module):\n",
        "    def __init__(self, input_dim):\n",
        "        super(Attention, self).__init__()\n",
        "        self.query = nn.Linear(input_dim, input_dim)\n",
        "        self.key = nn.Linear(input_dim, input_dim)\n",
        "        self.value = nn.Linear(input_dim, input_dim)\n",
        "        self.softmax = nn.Softmax(dim=-1)\n",
        "\n",
        "    def forward(self, x):\n",
        "        q = self.query(x)\n",
        "        k = self.key(x)\n",
        "        v = self.value(x)\n",
        "        attention_scores = torch.matmul(q, k.transpose(-2, -1)) / np.sqrt(x.size(-1))\n",
        "        attention_probs = self.softmax(attention_scores)\n",
        "        return torch.matmul(attention_probs, v)\n",
        "\n",
        "# Train the attention mechanism\n",
        "attention = Attention(token_vectors.size(-1))\n",
        "optimizer = Adam(attention.parameters(), lr=0.001)\n",
        "criterion = nn.MSELoss()\n",
        "\n",
        "num_epochs = 10\n",
        "for epoch in range(num_epochs):\n",
        "    optimizer.zero_grad()\n",
        "    output = attention(token_vectors)\n",
        "    loss = criterion(output, token_vectors)\n",
        "    loss.backward()\n",
        "    optimizer.step()\n",
        "    print(f'Epoch {epoch + 1}/{num_epochs}, Loss: {loss.item()}')\n",
        "\n",
        "# Identify highly related tokens based on the attention mechanism\n",
        "attention_output = attention(token_vectors).detach().numpy()\n",
        "similarity_threshold = 0.95\n",
        "related_token_pairs = []\n",
        "for i in range(len(unique_tokens)):\n",
        "    for j in range(i + 1, len(unique_tokens)):\n",
        "        similarity = np.dot(attention_output[i], attention_output[j]) / (np.linalg.norm(attention_output[i]) * np.linalg.norm(attention_output[j]))\n",
        "        if similarity > similarity_threshold:\n",
        "            related_token_pairs.append((unique_tokens[i], unique_tokens[j]))\n",
        "\n",
        "# Combine the related tokens into chunks\n",
        "chunk_counter = 0\n",
        "token_to_chunk = {}\n",
        "for token1, token2 in related_token_pairs:\n",
        "    if token1 not in token_to_chunk and token2 not in token_to_chunk:\n",
        "        chunk = f'<chunk_{chunk_counter}>'\n",
        "        token_to_chunk[token1] = chunk\n",
        "        token_to_chunk[token2] = chunk\n",
        "        chunk_counter += 1\n",
        "    elif token1 in token_to_chunk and token2 not in token_to_chunk:\n",
        "        token_to_chunk[token2] = token_to_chunk[token1]\n",
        "    elif token1 not in token_to_chunk and token2 in token_to_chunk:\n",
        "        token_to_chunk[token1] = token_to_chunk[token2]\n",
        "\n",
        "# # Replace the original tokens in the corpus with the corresponding chunk tokens\n",
        "# chunked_corpus = []\n",
        "# for line in tokenized_corpus:\n",
        "#     chunked_line = []\n",
        "#     for token in line:\n",
        "#         if token in token_to_chunk:\n",
        "#             chunked_line.append(token_to_chunk[token])\n",
        "#         else:\n",
        "#             chunked_line.append(token)\n",
        "#     chunked_corpus.append(chunked_line)\n",
        "\n",
        "# # Example output\n",
        "# print(chunked_corpus[:5])\n",
        "\n",
        "# Replace the original tokens in the corpus with the corresponding chunk tokens\n",
        "chunked_corpus = []\n",
        "for line in tokenized_corpus:\n",
        "    chunked_line = []\n",
        "    for token in line:\n",
        "        if token in token_to_chunk:\n",
        "            chunked_line.append(token_to_chunk[token])\n",
        "        else:\n",
        "            chunked_line.append(token)\n",
        "    chunked_corpus.append(chunked_line)\n",
        "\n",
        "# Print the first 10 sentences in a human-readable format\n",
        "print(\"First 10 chunked sentences:\")\n",
        "for i, line in enumerate(chunked_corpus[:10]):\n",
        "    print(f\"{i + 1}. {' '.join(line)}\")\n",
        "\n",
        "# Create a mapping from each cluster to a list of tokens that belong to it\n",
        "cluster_to_tokens = {}\n",
        "for token, chunk in token_to_chunk.items():\n",
        "    if chunk not in cluster_to_tokens:\n",
        "        cluster_to_tokens[chunk] = []\n",
        "    cluster_to_tokens[chunk].append(token)\n",
        "\n",
        "# Print the clusters\n",
        "print(\"\\nClusters:\")\n",
        "for chunk, tokens in cluster_to_tokens.items():\n",
        "    print(f\"{chunk}: {', '.join(tokens[:5])}\")\n"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "f2C9roqtXWXQ",
        "outputId": "45fdc193-307e-4127-d4dc-28a5c798050e"
      },
      "execution_count": 9,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stderr",
          "text": [
            "<ipython-input-9-ea2328d2c79e>:23: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:230.)\n",
            "  token_vectors = torch.tensor([model.wv[token] for token in unique_tokens])\n"
          ]
        },
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Epoch 1/10, Loss: 0.03309468552470207\n",
            "Epoch 2/10, Loss: 0.030443085357546806\n",
            "Epoch 3/10, Loss: 0.02805281989276409\n",
            "Epoch 4/10, Loss: 0.025900823995471\n",
            "Epoch 5/10, Loss: 0.023966899141669273\n",
            "Epoch 6/10, Loss: 0.022229302674531937\n",
            "Epoch 7/10, Loss: 0.02066372148692608\n",
            "Epoch 8/10, Loss: 0.019245551899075508\n",
            "Epoch 9/10, Loss: 0.017954334616661072\n",
            "Epoch 10/10, Loss: 0.016775138676166534\n",
            "First 10 chunked sentences:\n",
            "1. <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0>\n",
            "2. \n",
            "3. <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0>\n",
            "4. <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0>\n",
            "5. <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0>\n",
            "6. <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0>\n",
            "7. <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0>\n",
            "8. <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0>\n",
            "9. <chunk_0> <chunk_0> <chunk_0> <chunk_0> <chunk_0>\n",
            "10. \n",
            "\n",
            "Clusters:\n",
            "<chunk_0>: ,, ▁the, s, ▁“, .\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "3.  Collocation-based attention layer:\n",
        "    \n",
        "    -   Tokenize the corpus using SentencePiece.\n",
        "    -   Calculate collocation likelihood for pairs or groups of tokens (e.g., using pointwise mutual information or log-likelihood ratio).\n",
        "    -   Train an attention mechanism that focuses on collocational likelihood rather than semantic similarity.\n",
        "    -   Combine the tokens with high collocational likelihood into chunks.\n",
        "    -   Replace the original tokens in the corpus with the corresponding chunk tokens."
      ],
      "metadata": {
        "id": "WRni5zkaXWsg"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "!pip install sentencepiece nltk torch"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "BWJ1O9UlZjyv",
        "outputId": "a4730106-639d-4722-9aaf-959e417dbff8"
      },
      "execution_count": 10,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/\n",
            "Requirement already satisfied: sentencepiece in /usr/local/lib/python3.9/dist-packages (0.1.97)\n",
            "Requirement already satisfied: nltk in /usr/local/lib/python3.9/dist-packages (3.8.1)\n",
            "Requirement already satisfied: torch in /usr/local/lib/python3.9/dist-packages (1.13.1+cu116)\n",
            "Requirement already satisfied: joblib in /usr/local/lib/python3.9/dist-packages (from nltk) (1.1.1)\n",
            "Requirement already satisfied: click in /usr/local/lib/python3.9/dist-packages (from nltk) (8.1.3)\n",
            "Requirement already satisfied: tqdm in /usr/local/lib/python3.9/dist-packages (from nltk) (4.65.0)\n",
            "Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.9/dist-packages (from nltk) (2022.10.31)\n",
            "Requirement already satisfied: typing-extensions in /usr/local/lib/python3.9/dist-packages (from torch) (4.5.0)\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "import sentencepiece as spm\n",
        "import nltk\n",
        "from nltk.collocations import BigramAssocMeasures, BigramCollocationFinder\n",
        "import torch\n",
        "import torch.nn as nn\n",
        "from torch.optim import Adam\n",
        "\n",
        "# Train a SentencePiece model\n",
        "spm.SentencePieceTrainer.train(input='corpus.txt', model_prefix='spm_model', vocab_size=3000)\n",
        "sp = spm.SentencePieceProcessor()\n",
        "sp.load('spm_model.model')\n",
        "\n",
        "# Tokenize the corpus using SentencePiece\n",
        "with open('corpus.txt', 'r', encoding='utf-8') as f:\n",
        "    corpus = f.read()\n",
        "tokenized_corpus = [sp.encode_as_ids(line) for line in corpus.split('\\n')]\n",
        "\n",
        "# Calculate collocation likelihood for pairs of tokens using pointwise mutual information (PMI)\n",
        "bigram_measures = BigramAssocMeasures()\n",
        "finder = BigramCollocationFinder.from_documents(tokenized_corpus)\n",
        "finder.apply_freq_filter(5)  # filter out bigrams with less than 5 occurrences\n",
        "pmi_scores = finder.score_ngrams(bigram_measures.pmi)\n",
        "\n",
        "# Train an attention mechanism that focuses on collocational likelihood\n",
        "class Attention(nn.Module):\n",
        "    def __init__(self, vocab_size):\n",
        "        super(Attention, self).__init__()\n",
        "        self.embedding = nn.Embedding(vocab_size, vocab_size)\n",
        "        self.softmax = nn.Softmax(dim=-1)\n",
        "\n",
        "    def forward(self, x):\n",
        "        embedded_x = self.embedding(x)\n",
        "        attention_scores = torch.matmul(embedded_x, embedded_x.transpose(-2, -1))\n",
        "        attention_probs = self.softmax(attention_scores)\n",
        "        return attention_probs\n",
        "\n",
        "vocab_size = len(sp)\n",
        "attention = Attention(vocab_size)\n",
        "optimizer = Adam(attention.parameters(), lr=0.001)\n",
        "\n",
        "num_epochs = 10\n",
        "for epoch in range(num_epochs):\n",
        "    for line in tokenized_corpus:\n",
        "        line_tensor = torch.tensor(line, dtype=torch.long)\n",
        "        optimizer.zero_grad()\n",
        "        output = attention(line_tensor)\n",
        "        loss = -torch.mean(torch.log(torch.diag(output)))\n",
        "        loss.backward()\n",
        "        optimizer.step()\n",
        "    print(f'Epoch {epoch + 1}/{num_epochs}, Loss: {loss.item()}')\n",
        "\n",
        "# Combine tokens with high collocational likelihood into chunks\n",
        "pmi_threshold = 8  # adjust this threshold based on the dataset and desired granularity\n",
        "high_pmi_bigrams = [(token_pair, score) for token_pair, score in pmi_scores if score >= pmi_threshold]\n",
        "chunk_counter = 0\n",
        "token_to_chunk = {}\n",
        "for (token1, token2), _ in high_pmi_bigrams:\n",
        "    chunk = f'<chunk_{chunk_counter}>'\n",
        "    token_to_chunk[(token1, token2)] = chunk\n",
        "    chunk_counter += 1\n",
        "\n",
        "# Replace the original tokens in the corpus with the corresponding chunk tokens\n",
        "chunked_corpus = []\n",
        "for line in tokenized_corpus:\n",
        "    chunked_line = []\n",
        "    skip_next = False\n",
        "    for i, token in enumerate(line):\n",
        "        if i < len(line) - 1 and (token, line[i + 1]) in token_to_chunk:\n",
        "            chunked_line.append(token_to_chunk[(token, line[i + 1])])\n",
        "            skip_next = True\n",
        "        elif not skip_next:\n",
        "            chunked_line.append(sp.id_to_piece(token))\n",
        "        else:\n",
        "            skip_next = False\n",
        "    chunked_corpus.append(chunked_line)\n",
        "\n",
        "# # Save the chunked corpus\n",
        "# with open('chunked_corpus.txt', 'w', encoding='utf-8') as f:\n",
        "#     for chunked_line in chunked_corpus:\n",
        "#         f.write(' '.join(chunked_line) + '\\n')\n",
        "\n",
        "# Print the first 10 chunked sentences in a human-readable format\n",
        "print(\"First 10 chunked sentences:\")\n",
        "for i, line in enumerate(chunked_corpus[:10]):\n",
        "    print(f\"{i + 1}. {' '.join(line)}\")\n",
        "\n",
        "# Print high PMI token pairs\n",
        "print(\"\\nHigh PMI token pairs:\")\n",
        "for token_pair, chunk in token_to_chunk.items():\n",
        "    token1, token2 = token_pair\n",
        "    print(f\"{chunk}: {sp.id_to_piece(token1)} {sp.id_to_piece(token2)}\")\n"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "TOieJv_qXacZ",
        "outputId": "5aa45c7c-3627-44d7-cb44-0d9075a23ffa"
      },
      "execution_count": 14,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "Epoch 1/10, Loss: nan\n",
            "Epoch 2/10, Loss: nan\n",
            "Epoch 3/10, Loss: nan\n",
            "Epoch 4/10, Loss: nan\n",
            "Epoch 5/10, Loss: nan\n",
            "Epoch 6/10, Loss: nan\n",
            "Epoch 7/10, Loss: nan\n",
            "Epoch 8/10, Loss: nan\n",
            "Epoch 9/10, Loss: nan\n",
            "Epoch 10/10, Loss: nan\n",
            "First 10 chunked sentences:\n",
            "1. ▁The <chunk_67> ▁eBook ▁of ▁Alice ’ s ▁Adventures ▁in ▁Wonderland , ▁by ▁Le wis ▁Ca r roll\n",
            "2. \n",
            "3. ▁This ▁eBook ▁is ▁for ▁the ▁use ▁of ▁anyone ▁any where ▁in ▁the <chunk_10> ▁and\n",
            "4. ▁most ▁other ▁part s ▁of ▁the ▁world ▁at ▁no ▁cost ▁and ▁with <chunk_3> ▁no ▁restrictions\n",
            "5. ▁what so ever . <chunk_70> ▁copy ▁it , ▁give ▁it ▁away ▁or ▁re - use ▁it ▁under ▁the ▁terms\n",
            "6. ▁of ▁the <chunk_67> ▁License ▁included ▁with ▁this ▁eBook ▁or ▁on line ▁at\n",
            "7. <chunk_43> . <chunk_17> . org . ▁If ▁you ▁are ▁not ▁locat ed ▁in ▁the <chunk_10> , ▁you\n",
            "8. ▁will ▁have ▁to ▁check ▁the ▁laws ▁of ▁the <chunk_51> ▁where ▁you ▁are ▁locat ed ▁before\n",
            "9. ▁us ing ▁this ▁eBook .\n",
            "10. \n",
            "\n",
            "High PMI token pairs:\n",
            "<chunk_0>: ▁permi ssion\n",
            "<chunk_1>: ▁gui nea\n",
            "<chunk_2>: ▁white ▁kid\n",
            "<chunk_3>: ▁al most\n",
            "<chunk_4>: ▁exce pt\n",
            "<chunk_5>: ▁\" Project\n",
            "<chunk_6>: ▁Y OU\n",
            "<chunk_7>: ▁dre w\n",
            "<chunk_8>: ▁k ept\n",
            "<chunk_9>: ▁run ning\n",
            "<chunk_10>: ▁United ▁States\n",
            "<chunk_11>: ▁few ▁minutes\n",
            "<chunk_12>: ▁Whi le\n",
            "<chunk_13>: ▁sh ook\n",
            "<chunk_14>: ▁f etch\n",
            "<chunk_15>: ▁whe ther\n",
            "<chunk_16>: ▁beautiful ▁Soup\n",
            "<chunk_17>: g utenberg\n",
            "<chunk_18>: at ▁least\n",
            "<chunk_19>: CHA PT\n",
            "<chunk_20>: form ation\n",
            "<chunk_21>: PT ER\n",
            "<chunk_22>: ▁set ▁forth\n",
            "<chunk_23>: rch ive\n",
            "<chunk_24>: Of ▁course\n",
            "<chunk_25>: ▁forgot ten\n",
            "<chunk_26>: ▁sever al\n",
            "<chunk_27>: ▁March ▁Hare\n",
            "<chunk_28>: ▁writ ten\n",
            "<chunk_29>: F .3\n",
            "<chunk_30>: ive ▁Foundation\n",
            "<chunk_31>: ▁three ▁gardeners\n",
            "<chunk_32>: ▁O F\n",
            "<chunk_33>: ▁good ▁deal\n",
            "<chunk_34>: ▁How ever\n",
            "<chunk_35>: ▁w aving\n",
            "<chunk_36>: ▁White ▁Rabbit\n",
            "<chunk_37>: shi re\n",
            "<chunk_38>: ▁1. E\n",
            "<chunk_39>: ▁electronic ▁works\n",
            "<chunk_40>: ▁next ▁witness\n",
            "<chunk_41>: ▁their ▁slates\n",
            "<chunk_42>: ▁w ish\n",
            "<chunk_43>: ▁w ww\n",
            "<chunk_44>: ▁paragraph ▁1.\n",
            "<chunk_45>: ▁A fter\n",
            "<chunk_46>: ▁feet ▁high\n",
            "<chunk_47>: ▁Mock ▁Turtle\n",
            "<chunk_48>: ▁poo r\n",
            "<chunk_49>: ▁A rch\n",
            "<chunk_50>: ▁opportunit y\n",
            "<chunk_51>: ▁countr y\n",
            "<chunk_52>: ▁your ▁Majesty\n",
            "<chunk_53>: ▁their ▁heads\n",
            "<chunk_54>: ▁Literary ▁A\n",
            "<chunk_55>: ▁great ▁hurry\n",
            "<chunk_56>: e shi\n",
            "<chunk_57>: ▁Ch e\n",
            "<chunk_58>: ▁* ▁*\n",
            "<chunk_59>: ▁any ▁rate\n",
            "<chunk_60>: ▁offended ▁tone\n",
            "<chunk_61>: ▁ga ve\n",
            "<chunk_62>: ▁right ▁size\n",
            "<chunk_63>: tm ▁License\n",
            "<chunk_64>: ▁beg ▁your\n",
            "<chunk_65>: ▁Gutenberg ▁Literary\n",
            "<chunk_66>: ▁its ▁mouth\n",
            "<chunk_67>: ▁Project ▁Gutenberg\n",
            "<chunk_68>: ▁1. F\n",
            "<chunk_69>: tm ▁electronic\n",
            "<chunk_70>: ▁You ▁may\n",
            "<chunk_71>: ▁brea d\n",
            "<chunk_72>: <unk> 0\n",
            "<chunk_73>: ▁electronic ▁work\n",
            "<chunk_74>: ▁stoo d\n",
            "<chunk_75>: ▁( she\n",
            "<chunk_76>: Let ▁me\n",
            "<chunk_77>: m ▁afraid\n",
            "<chunk_78>: ▁another ▁moment\n",
            "<chunk_79>: ▁is —‘\n",
            "<chunk_80>: ▁hard ly\n",
            "<chunk_81>: ▁very ▁politely\n",
            "<chunk_82>: ▁not ▁protected\n",
            "<chunk_83>: — oop\n",
            "<chunk_84>: ▁Whe n\n",
            "<chunk_85>: ▁left ▁off\n"
          ]
        }
      ]
    }
  ]
}