avidale/subparagraphs.ipynb

## subparagraphs.ipynb
{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "subparagraphs.ipynb",
      "provenance": [],
      "collapsed_sections": [],
      "authorship_tag": "ABX9TyPPFlpnRjBayY9yB+TN3dl6",
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/gist/avidale/e4450da902d36bb14c595987943120dc/subparagraphs.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "DXxaAlxZy8tA",
        "colab_type": "text"
      },
      "source": [
        "The goal is to split a text into meanungful subparagraphs - see https://stackoverflow.com/questions/62164280.\n",
        "\n",
        "\"Meaningfulness\" will be measured by similarity of consecutive sentence vectors: we want neighboring sentences in the same subparagraph to be similar. \n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "46T3HNB7k310",
        "colab_type": "code",
        "outputId": "dc301953-e6b8-4bd3-ca6e-e7799d8cc2a3",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 51
        }
      },
      "source": [
        "from sklearn.dammtasets import fetch_20newsgroups\n",
        "twenty_train = fetch_20newsgroups(subset='train', shuffle=True, random_state=42)"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Downloading 20news dataset. This may take a few minutes.\n",
            "Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)\n"
          ],
          "name": "stderr"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "zB6ngeWYnGah",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "!python -m spacy download en_core_web_sm"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Q5dDyE8clGPH",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "import spacy\n",
        "import numpy as np\n",
        "nlp = spacy.load('en_core_web_sm')"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "BkIo9Celygia",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "text = twenty_train.data[1]"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "PnDEuQcImj_l",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "doc = nlp(text)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "pxTZmweynRpz",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "sents = list(doc.sents)\n",
        "vecs = np.stack([sent.vector / sent.vector_norm for sent in sents])"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "umXXkkrtzqTE",
        "colab_type": "text"
      },
      "source": [
        "This parameter should be tuned in order to make the segmentation as meaningful as possible. "
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "yPQgOj1un-eB",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "threshold = 0.5"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "r29io_MipgQT",
        "colab_type": "code",
        "outputId": "fe588920-f7b0-44db-8d15-403ff6ce628f",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 34
        }
      },
      "source": [
        "clusters = [[0]]\n",
        "for i in range(1, len(sents)):\n",
        "    if np.dot(vecs[i], vecs[i-1]) < threshold:\n",
        "        # here we use only the similarity between neighboring pairs of sentences. \n",
        "        # instead, we can use the \"weakest link\" or \"strongest link\" approach.\n",
        "        # potentially, it could improve the quality of clustering. \n",
        "        clusters.append([])\n",
        "    clusters[-1].append(i)\n",
        "print(clusters)"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "[[0], [1], [2], [3], [4], [5], [6, 7, 8], [9], [10], [11, 12], [13], [14], [15, 16], [17], [18]]\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "DjnHMZGrwMyV",
        "colab_type": "code",
        "outputId": "2b847ccc-554e-4d78-b7bc-904315b56782",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 867
        }
      },
      "source": [
        "for cluster in clusters:\n",
        "    print(' '.join([sents[i].text for i in cluster]))\n",
        "    print('---------------------------------------')"
      ],
      "execution_count": 0,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "From: guykuo@carson.u.washington.edu\n",
            "---------------------------------------\n",
            "(Guy Kuo)\n",
            "\n",
            "---------------------------------------\n",
            "Subject:\n",
            "---------------------------------------\n",
            "SI Clock Poll - Final Call\n",
            "\n",
            "---------------------------------------\n",
            "Summary:\n",
            "---------------------------------------\n",
            "Final call for SI clock reports\n",
            "\n",
            "---------------------------------------\n",
            "Keywords: SI,acceleration,clock,upgrade\n",
            " Article-I.D.: shelley.1qvfo9INNc3s\n",
            "Organization: University of Washington\n",
            "Lines: 11\n",
            "\n",
            "---------------------------------------\n",
            "NNTP-Posting-Host:\n",
            "---------------------------------------\n",
            "carson.u.washington.edu\n",
            "\n",
            "\n",
            "---------------------------------------\n",
            "A fair number of brave souls who upgraded their SI clock oscillator have\n",
            "shared their experiences for this poll. Please send a brief message detailing\n",
            "your experiences with the procedure.\n",
            "---------------------------------------\n",
            "Top speed attained, CPU rated speed,\n",
            "add on cards and adapters, heat sinks, hour of usage per day, floppy disk\n",
            "functionality with 800 and 1.4\n",
            "---------------------------------------\n",
            "m floppies are especially requested.\n",
            "\n",
            "\n",
            "---------------------------------------\n",
            "I will be summarizing in the next two days, so please add to the network\n",
            "knowledge base if you have done the clock upgrade and haven't answered this\n",
            "poll.\n",
            "---------------------------------------\n",
            "Thanks.\n",
            "\n",
            "\n",
            "---------------------------------------\n",
            "Guy Kuo <guykuo@u.washington.edu>\n",
            "\n",
            "---------------------------------------\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "bn5W9KBSwuGw",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        ""
      ],
      "execution_count": 0,
      "outputs": []
    }
  ]
}
	{
	"nbformat": 4,
	"nbformat_minor": 0,
	"metadata": {
	"colab": {
	"name": "subparagraphs.ipynb",
	"provenance": [],
	"collapsed_sections": [],
	"authorship_tag": "ABX9TyPPFlpnRjBayY9yB+TN3dl6",
	"include_colab_link": true
	},
	"kernelspec": {
	"name": "python3",
	"display_name": "Python 3"
	}
	},
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "view-in-github",
	"colab_type": "text"
	},
	"source": [
	"<a href=\"https://colab.research.google.com/gist/avidale/e4450da902d36bb14c595987943120dc/subparagraphs.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "DXxaAlxZy8tA",
	"colab_type": "text"
	},
	"source": [
	"The goal is to split a text into meanungful subparagraphs - see https://stackoverflow.com/questions/62164280.\n",
	"\n",
	"\"Meaningfulness\" will be measured by similarity of consecutive sentence vectors: we want neighboring sentences in the same subparagraph to be similar. \n"
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "46T3HNB7k310",
	"colab_type": "code",
	"outputId": "dc301953-e6b8-4bd3-ca6e-e7799d8cc2a3",
	"colab": {
	"base_uri": "https://localhost:8080/",
	"height": 51
	}
	},
	"source": [
	"from sklearn.dammtasets import fetch_20newsgroups\n",
	"twenty_train = fetch_20newsgroups(subset='train', shuffle=True, random_state=42)"
	],
	"execution_count": 0,
	"outputs": [
	{
	"output_type": "stream",
	"text": [
	"Downloading 20news dataset. This may take a few minutes.\n",
	"Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)\n"
	],
	"name": "stderr"
	}
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "zB6ngeWYnGah",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"!python -m spacy download en_core_web_sm"
	],
	"execution_count": 0,
	"outputs": []
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "Q5dDyE8clGPH",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"import spacy\n",
	"import numpy as np\n",
	"nlp = spacy.load('en_core_web_sm')"
	],
	"execution_count": 0,
	"outputs": []
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "BkIo9Celygia",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"text = twenty_train.data[1]"
	],
	"execution_count": 0,
	"outputs": []
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "PnDEuQcImj_l",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"doc = nlp(text)"
	],
	"execution_count": 0,
	"outputs": []
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "pxTZmweynRpz",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"sents = list(doc.sents)\n",
	"vecs = np.stack([sent.vector / sent.vector_norm for sent in sents])"
	],
	"execution_count": 0,
	"outputs": []
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "umXXkkrtzqTE",
	"colab_type": "text"
	},
	"source": [
	"This parameter should be tuned in order to make the segmentation as meaningful as possible. "
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "yPQgOj1un-eB",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	"threshold = 0.5"
	],
	"execution_count": 0,
	"outputs": []
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "r29io_MipgQT",
	"colab_type": "code",
	"outputId": "fe588920-f7b0-44db-8d15-403ff6ce628f",
	"colab": {
	"base_uri": "https://localhost:8080/",
	"height": 34
	}
	},
	"source": [
	"clusters = [[0]]\n",
	"for i in range(1, len(sents)):\n",
	" if np.dot(vecs[i], vecs[i-1]) < threshold:\n",
	" # here we use only the similarity between neighboring pairs of sentences. \n",
	" # instead, we can use the \"weakest link\" or \"strongest link\" approach.\n",
	" # potentially, it could improve the quality of clustering. \n",
	" clusters.append([])\n",
	" clusters[-1].append(i)\n",
	"print(clusters)"
	],
	"execution_count": 0,
	"outputs": [
	{
	"output_type": "stream",
	"text": [
	"[[0], [1], [2], [3], [4], [5], [6, 7, 8], [9], [10], [11, 12], [13], [14], [15, 16], [17], [18]]\n"
	],
	"name": "stdout"
	}
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "DjnHMZGrwMyV",
	"colab_type": "code",
	"outputId": "2b847ccc-554e-4d78-b7bc-904315b56782",
	"colab": {
	"base_uri": "https://localhost:8080/",
	"height": 867
	}
	},
	"source": [
	"for cluster in clusters:\n",
	" print(' '.join([sents[i].text for i in cluster]))\n",
	" print('---------------------------------------')"
	],
	"execution_count": 0,
	"outputs": [
	{
	"output_type": "stream",
	"text": [
	"From: guykuo@carson.u.washington.edu\n",
	"---------------------------------------\n",
	"(Guy Kuo)\n",
	"\n",
	"---------------------------------------\n",
	"Subject:\n",
	"---------------------------------------\n",
	"SI Clock Poll - Final Call\n",
	"\n",
	"---------------------------------------\n",
	"Summary:\n",
	"---------------------------------------\n",
	"Final call for SI clock reports\n",
	"\n",
	"---------------------------------------\n",
	"Keywords: SI,acceleration,clock,upgrade\n",
	" Article-I.D.: shelley.1qvfo9INNc3s\n",
	"Organization: University of Washington\n",
	"Lines: 11\n",
	"\n",
	"---------------------------------------\n",
	"NNTP-Posting-Host:\n",
	"---------------------------------------\n",
	"carson.u.washington.edu\n",
	"\n",
	"\n",
	"---------------------------------------\n",
	"A fair number of brave souls who upgraded their SI clock oscillator have\n",
	"shared their experiences for this poll. Please send a brief message detailing\n",
	"your experiences with the procedure.\n",
	"---------------------------------------\n",
	"Top speed attained, CPU rated speed,\n",
	"add on cards and adapters, heat sinks, hour of usage per day, floppy disk\n",
	"functionality with 800 and 1.4\n",
	"---------------------------------------\n",
	"m floppies are especially requested.\n",
	"\n",
	"\n",
	"---------------------------------------\n",
	"I will be summarizing in the next two days, so please add to the network\n",
	"knowledge base if you have done the clock upgrade and haven't answered this\n",
	"poll.\n",
	"---------------------------------------\n",
	"Thanks.\n",
	"\n",
	"\n",
	"---------------------------------------\n",
	"Guy Kuo <guykuo@u.washington.edu>\n",
	"\n",
	"---------------------------------------\n"
	],
	"name": "stdout"
	}
	]
	},
	{
	"cell_type": "code",
	"metadata": {
	"id": "bn5W9KBSwuGw",
	"colab_type": "code",
	"colab": {}
	},
	"source": [
	""
	],
	"execution_count": 0,
	"outputs": []
	}
	]
	}