Skip to content

Instantly share code, notes, and snippets.

@avidale
Last active May 25, 2023 21:25
Show Gist options
  • Save avidale/e4450da902d36bb14c595987943120dc to your computer and use it in GitHub Desktop.
Save avidale/e4450da902d36bb14c595987943120dc to your computer and use it in GitHub Desktop.
subparagraphs.ipynb
Display the source blob
Display the rendered blob
Raw
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "subparagraphs.ipynb",
"provenance": [],
"collapsed_sections": [],
"authorship_tag": "ABX9TyPPFlpnRjBayY9yB+TN3dl6",
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/avidale/e4450da902d36bb14c595987943120dc/subparagraphs.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "DXxaAlxZy8tA",
"colab_type": "text"
},
"source": [
"The goal is to split a text into meanungful subparagraphs - see https://stackoverflow.com/questions/62164280.\n",
"\n",
"\"Meaningfulness\" will be measured by similarity of consecutive sentence vectors: we want neighboring sentences in the same subparagraph to be similar. \n"
]
},
{
"cell_type": "code",
"metadata": {
"id": "46T3HNB7k310",
"colab_type": "code",
"outputId": "dc301953-e6b8-4bd3-ca6e-e7799d8cc2a3",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 51
}
},
"source": [
"from sklearn.dammtasets import fetch_20newsgroups\n",
"twenty_train = fetch_20newsgroups(subset='train', shuffle=True, random_state=42)"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"Downloading 20news dataset. This may take a few minutes.\n",
"Downloading dataset from https://ndownloader.figshare.com/files/5975967 (14 MB)\n"
],
"name": "stderr"
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "zB6ngeWYnGah",
"colab_type": "code",
"colab": {}
},
"source": [
"!python -m spacy download en_core_web_sm"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "Q5dDyE8clGPH",
"colab_type": "code",
"colab": {}
},
"source": [
"import spacy\n",
"import numpy as np\n",
"nlp = spacy.load('en_core_web_sm')"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "BkIo9Celygia",
"colab_type": "code",
"colab": {}
},
"source": [
"text = twenty_train.data[1]"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "PnDEuQcImj_l",
"colab_type": "code",
"colab": {}
},
"source": [
"doc = nlp(text)"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "pxTZmweynRpz",
"colab_type": "code",
"colab": {}
},
"source": [
"sents = list(doc.sents)\n",
"vecs = np.stack([sent.vector / sent.vector_norm for sent in sents])"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "umXXkkrtzqTE",
"colab_type": "text"
},
"source": [
"This parameter should be tuned in order to make the segmentation as meaningful as possible. "
]
},
{
"cell_type": "code",
"metadata": {
"id": "yPQgOj1un-eB",
"colab_type": "code",
"colab": {}
},
"source": [
"threshold = 0.5"
],
"execution_count": 0,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "r29io_MipgQT",
"colab_type": "code",
"outputId": "fe588920-f7b0-44db-8d15-403ff6ce628f",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 34
}
},
"source": [
"clusters = [[0]]\n",
"for i in range(1, len(sents)):\n",
" if np.dot(vecs[i], vecs[i-1]) < threshold:\n",
" # here we use only the similarity between neighboring pairs of sentences. \n",
" # instead, we can use the \"weakest link\" or \"strongest link\" approach.\n",
" # potentially, it could improve the quality of clustering. \n",
" clusters.append([])\n",
" clusters[-1].append(i)\n",
"print(clusters)"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"[[0], [1], [2], [3], [4], [5], [6, 7, 8], [9], [10], [11, 12], [13], [14], [15, 16], [17], [18]]\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "DjnHMZGrwMyV",
"colab_type": "code",
"outputId": "2b847ccc-554e-4d78-b7bc-904315b56782",
"colab": {
"base_uri": "https://localhost:8080/",
"height": 867
}
},
"source": [
"for cluster in clusters:\n",
" print(' '.join([sents[i].text for i in cluster]))\n",
" print('---------------------------------------')"
],
"execution_count": 0,
"outputs": [
{
"output_type": "stream",
"text": [
"From: guykuo@carson.u.washington.edu\n",
"---------------------------------------\n",
"(Guy Kuo)\n",
"\n",
"---------------------------------------\n",
"Subject:\n",
"---------------------------------------\n",
"SI Clock Poll - Final Call\n",
"\n",
"---------------------------------------\n",
"Summary:\n",
"---------------------------------------\n",
"Final call for SI clock reports\n",
"\n",
"---------------------------------------\n",
"Keywords: SI,acceleration,clock,upgrade\n",
" Article-I.D.: shelley.1qvfo9INNc3s\n",
"Organization: University of Washington\n",
"Lines: 11\n",
"\n",
"---------------------------------------\n",
"NNTP-Posting-Host:\n",
"---------------------------------------\n",
"carson.u.washington.edu\n",
"\n",
"\n",
"---------------------------------------\n",
"A fair number of brave souls who upgraded their SI clock oscillator have\n",
"shared their experiences for this poll. Please send a brief message detailing\n",
"your experiences with the procedure.\n",
"---------------------------------------\n",
"Top speed attained, CPU rated speed,\n",
"add on cards and adapters, heat sinks, hour of usage per day, floppy disk\n",
"functionality with 800 and 1.4\n",
"---------------------------------------\n",
"m floppies are especially requested.\n",
"\n",
"\n",
"---------------------------------------\n",
"I will be summarizing in the next two days, so please add to the network\n",
"knowledge base if you have done the clock upgrade and haven't answered this\n",
"poll.\n",
"---------------------------------------\n",
"Thanks.\n",
"\n",
"\n",
"---------------------------------------\n",
"Guy Kuo <guykuo@u.washington.edu>\n",
"\n",
"---------------------------------------\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "bn5W9KBSwuGw",
"colab_type": "code",
"colab": {}
},
"source": [
""
],
"execution_count": 0,
"outputs": []
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment