wesslen/ceo-letters-chunking.ipynb

## ceo-letters-chunking.ipynb
{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/gist/wesslen/44d097819019b56f919fa4f84eb33b32/ceo-letters-chunking.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "lXaRKvUg8aw0"
      },
      "source": [
        "## Intro\n",
        "\n",
        "This notebook will use LangChain's textsplitter and spaCy to break up all of the documents into sentences. Per the [Tips of BERTopic](https://maartengr.github.io/BERTopic/getting_started/tips_and_tricks/tips_and_tricks.html#document-length), if using sentence-transformers, it recommends before running either prep data on a sentence level or a paragraph level.\n",
        "\n",
        "For simplicity, I'm showing how to do this on a sentence-level."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "SIsOgj0_QV2c",
        "outputId": "2a269677-a126-467e-f19a-48230b692145"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m289.1/289.1 kB\u001b[0m \u001b[31m4.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m113.7/113.7 kB\u001b[0m \u001b[31m3.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m53.0/53.0 kB\u001b[0m \u001b[31m1.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m141.1/141.1 kB\u001b[0m \u001b[31m1.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25h"
          ]
        }
      ],
      "source": [
        "%pip install -qU langchain-text-splitters\n",
        "%pip install --upgrade --quiet  spacy"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "AGFP46pX97md"
      },
      "source": [
        "Make sure to upload the zip file of the CEO Letters if running in Colab."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "id": "1PlpFojLfIE6"
      },
      "outputs": [],
      "source": [
        "%%capture\n",
        "!unzip ceo-letters-shareholders.zip"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "colab": {
          "background_save": true,
          "base_uri": "https://localhost:8080/"
        },
        "id": "89CM3U4Se-4E",
        "outputId": "621d4645-95ca-4f6a-af3b-ab3f99133a3f"
      },
      "outputs": [
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "/usr/local/lib/python3.10/dist-packages/spacy/pipeline/lemmatizer.py:211: UserWarning: [W108] The rule-based lemmatizer did not find POS annotation for one or more tokens. Check that your pipeline includes components that assign token.pos, typically 'tagger'+'attribute_ruler' or 'morphologizer'.\n",
            "  warnings.warn(Warnings.W108)\n",
            "WARNING:langchain_text_splitters.base:Created a chunk of size 1397, which is longer than the specified 1000\n",
            "WARNING:langchain_text_splitters.base:Created a chunk of size 1301, which is longer than the specified 1000\n",
            "WARNING:langchain_text_splitters.base:Created a chunk of size 1512, which is longer than the specified 1000\n",
            "WARNING:langchain_text_splitters.base:Created a chunk of size 1422, which is longer than the specified 1000\n",
            "WARNING:langchain_text_splitters.base:Created a chunk of size 1117, which is longer than the specified 1000\n",
            "WARNING:langchain_text_splitters.base:Created a chunk of size 1467, which is longer than the specified 1000\n",
            "WARNING:langchain_text_splitters.base:Created a chunk of size 1921, which is longer than the specified 1000\n",
            "WARNING:langchain_text_splitters.base:Created a chunk of size 2834, which is longer than the specified 1000\n",
            "WARNING:langchain_text_splitters.base:Created a chunk of size 2184, which is longer than the specified 1000\n",
            "WARNING:langchain_text_splitters.base:Created a chunk of size 1079, which is longer than the specified 1000\n",
            "WARNING:langchain_text_splitters.base:Created a chunk of size 1107, which is longer than the specified 1000\n",
            "WARNING:langchain_text_splitters.base:Created a chunk of size 1030, which is longer than the specified 1000\n",
            "WARNING:langchain_text_splitters.base:Created a chunk of size 1006, which is longer than the specified 1000\n",
            "WARNING:langchain_text_splitters.base:Created a chunk of size 1416, which is longer than the specified 1000\n",
            "WARNING:langchain_text_splitters.base:Created a chunk of size 1222, which is longer than the specified 1000\n"
          ]
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "                                                text                filename\n",
            "0  To our stockholders, \\n\\nFiscal 2021 was a tru...  Cisco_Robbins_2021.txt\n",
            "1  At Cisco, we are incredibly proud of how our t...  Cisco_Robbins_2021.txt\n",
            "2  In the fourth quarter, we experienced the stro...  Cisco_Robbins_2021.txt\n",
            "3  We also achieved software subscription revenue...  Cisco_Robbins_2021.txt\n",
            "4  Power of our portfolio— Leading with innovatio...  Cisco_Robbins_2021.txt\n"
          ]
        }
      ],
      "source": [
        "import os\n",
        "import pandas as pd\n",
        "from langchain_text_splitters import SpacyTextSplitter\n",
        "\n",
        "# Directory containing the .txt files\n",
        "directory = \".\"\n",
        "\n",
        "# Initialize the text splitter with the desired configuration\n",
        "text_splitter = SpacyTextSplitter(chunk_size=1000)\n",
        "\n",
        "# Prepare to collect data in a list\n",
        "data = []\n",
        "\n",
        "# Loop through each file in the directory\n",
        "for filename in os.listdir(directory):\n",
        "    if filename.endswith(\".txt\"):\n",
        "        # Full path to the file\n",
        "        file_path = os.path.join(directory, filename)\n",
        "\n",
        "        # Read the file content\n",
        "        with open(file_path, 'r') as file:\n",
        "            doc = file.read()\n",
        "\n",
        "        # Create documents (chunks) from the file content\n",
        "        texts = text_splitter.create_documents([doc])\n",
        "\n",
        "        # Filter and collect chunks with a sufficient number of words\n",
        "        for text in texts:\n",
        "            if len(text.page_content.split()) > 5:\n",
        "                data.append({\n",
        "                    'text': text.page_content,\n",
        "                    'filename': filename\n",
        "                })\n",
        "\n",
        "# Convert the list of dictionaries to a DataFrame\n",
        "df = pd.DataFrame(data, columns=['text', 'filename'])\n",
        "\n",
        "# Display the DataFrame structure\n",
        "print(df.head())\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 5,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 122
        },
        "id": "_bjUIZT858Z9",
        "outputId": "bce85972-d3c8-49bd-e226-c6f845695617"
      },
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "'We continue to provide COVID-19 relief around the world through donations of products, cash and personal protective equipment.\\n\\nAnd while the pandemic is rightfully top of mind, there are many people who’ve also faced natural disasters around the world, and we’ve responded to support those dealing with fires, floods, typhoons, hurricanes and other emergencies.\\n\\nOur Children’s Safe Drinking Water Program continues to provide clean drinking water to those who lack access, reaching 18 billion liters of clean water provided since the program began, thanks to the collaboration of more than 150 partners, and we’re on our way to 25 billion liters by 2025.\\n\\nEthics & Corporate Responsibility\\u2009—\\u2009good governance\\u2009—\\u2009 is the foundation for everything we do at P&G, including our Citizenship work.'"
            ],
            "application/vnd.google.colaboratory.intrinsic+json": {
              "type": "string"
            }
          },
          "metadata": {},
          "execution_count": 5
        }
      ],
      "source": [
        "df.iloc[2000][\"text\"]"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 6,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "QWmotMc1443S",
        "outputId": "24e7ee86-6634-46e3-fbf3-059a610dbbf2"
      },
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "(5225, 2)"
            ]
          },
          "metadata": {},
          "execution_count": 6
        }
      ],
      "source": [
        "df.shape"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "You can now save your dataframe as a `csv` file and then use a new notebook for your analysis."
      ],
      "metadata": {
        "id": "SDfD_zBYAlMD"
      }
    }
  ],
  "metadata": {
    "colab": {
      "provenance": [],
      "authorship_tag": "ABX9TyN9prDp8Vgj/ivfpxj8nf9I",
      "include_colab_link": true
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "view-in-github",
	"colab_type": "text"
	},
	"source": [
	"<a href=\"https://colab.research.google.com/gist/wesslen/44d097819019b56f919fa4f84eb33b32/ceo-letters-chunking.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "lXaRKvUg8aw0"
	},
	"source": [
	"## Intro\n",
	"\n",
	"This notebook will use LangChain's textsplitter and spaCy to break up all of the documents into sentences. Per the [Tips of BERTopic](https://maartengr.github.io/BERTopic/getting_started/tips_and_tricks/tips_and_tricks.html#document-length), if using sentence-transformers, it recommends before running either prep data on a sentence level or a paragraph level.\n",
	"\n",
	"For simplicity, I'm showing how to do this on a sentence-level."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"colab": {
	"base_uri": "https://localhost:8080/"
	},
	"id": "SIsOgj0_QV2c",
	"outputId": "2a269677-a126-467e-f19a-48230b692145"
	},
	"outputs": [
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m289.1/289.1 kB\u001b[0m \u001b[31m4.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
	"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m113.7/113.7 kB\u001b[0m \u001b[31m3.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
	"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m53.0/53.0 kB\u001b[0m \u001b[31m1.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
	"\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m141.1/141.1 kB\u001b[0m \u001b[31m1.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
	"\u001b[?25h"
	]
	}
	],
	"source": [
	"%pip install -qU langchain-text-splitters\n",
	"%pip install --upgrade --quiet spacy"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"id": "AGFP46pX97md"
	},
	"source": [
	"Make sure to upload the zip file of the CEO Letters if running in Colab."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"id": "1PlpFojLfIE6"
	},
	"outputs": [],
	"source": [
	"%%capture\n",
	"!unzip ceo-letters-shareholders.zip"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"colab": {
	"background_save": true,
	"base_uri": "https://localhost:8080/"
	},
	"id": "89CM3U4Se-4E",
	"outputId": "621d4645-95ca-4f6a-af3b-ab3f99133a3f"
	},
	"outputs": [
	{
	"name": "stderr",
	"output_type": "stream",
	"text": [
	"/usr/local/lib/python3.10/dist-packages/spacy/pipeline/lemmatizer.py:211: UserWarning: [W108] The rule-based lemmatizer did not find POS annotation for one or more tokens. Check that your pipeline includes components that assign token.pos, typically 'tagger'+'attribute_ruler' or 'morphologizer'.\n",
	" warnings.warn(Warnings.W108)\n",
	"WARNING:langchain_text_splitters.base:Created a chunk of size 1397, which is longer than the specified 1000\n",
	"WARNING:langchain_text_splitters.base:Created a chunk of size 1301, which is longer than the specified 1000\n",
	"WARNING:langchain_text_splitters.base:Created a chunk of size 1512, which is longer than the specified 1000\n",
	"WARNING:langchain_text_splitters.base:Created a chunk of size 1422, which is longer than the specified 1000\n",
	"WARNING:langchain_text_splitters.base:Created a chunk of size 1117, which is longer than the specified 1000\n",
	"WARNING:langchain_text_splitters.base:Created a chunk of size 1467, which is longer than the specified 1000\n",
	"WARNING:langchain_text_splitters.base:Created a chunk of size 1921, which is longer than the specified 1000\n",
	"WARNING:langchain_text_splitters.base:Created a chunk of size 2834, which is longer than the specified 1000\n",
	"WARNING:langchain_text_splitters.base:Created a chunk of size 2184, which is longer than the specified 1000\n",
	"WARNING:langchain_text_splitters.base:Created a chunk of size 1079, which is longer than the specified 1000\n",
	"WARNING:langchain_text_splitters.base:Created a chunk of size 1107, which is longer than the specified 1000\n",
	"WARNING:langchain_text_splitters.base:Created a chunk of size 1030, which is longer than the specified 1000\n",
	"WARNING:langchain_text_splitters.base:Created a chunk of size 1006, which is longer than the specified 1000\n",
	"WARNING:langchain_text_splitters.base:Created a chunk of size 1416, which is longer than the specified 1000\n",
	"WARNING:langchain_text_splitters.base:Created a chunk of size 1222, which is longer than the specified 1000\n"
	]
	},
	{
	"name": "stdout",
	"output_type": "stream",
	"text": [
	" text filename\n",
	"0 To our stockholders, \\n\\nFiscal 2021 was a tru... Cisco_Robbins_2021.txt\n",
	"1 At Cisco, we are incredibly proud of how our t... Cisco_Robbins_2021.txt\n",
	"2 In the fourth quarter, we experienced the stro... Cisco_Robbins_2021.txt\n",
	"3 We also achieved software subscription revenue... Cisco_Robbins_2021.txt\n",
	"4 Power of our portfolio— Leading with innovatio... Cisco_Robbins_2021.txt\n"
	]
	}
	],
	"source": [
	"import os\n",
	"import pandas as pd\n",
	"from langchain_text_splitters import SpacyTextSplitter\n",
	"\n",
	"# Directory containing the .txt files\n",
	"directory = \".\"\n",
	"\n",
	"# Initialize the text splitter with the desired configuration\n",
	"text_splitter = SpacyTextSplitter(chunk_size=1000)\n",
	"\n",
	"# Prepare to collect data in a list\n",
	"data = []\n",
	"\n",
	"# Loop through each file in the directory\n",
	"for filename in os.listdir(directory):\n",
	" if filename.endswith(\".txt\"):\n",
	" # Full path to the file\n",
	" file_path = os.path.join(directory, filename)\n",
	"\n",
	" # Read the file content\n",
	" with open(file_path, 'r') as file:\n",
	" doc = file.read()\n",
	"\n",
	" # Create documents (chunks) from the file content\n",
	" texts = text_splitter.create_documents([doc])\n",
	"\n",
	" # Filter and collect chunks with a sufficient number of words\n",
	" for text in texts:\n",
	" if len(text.page_content.split()) > 5:\n",
	" data.append({\n",
	" 'text': text.page_content,\n",
	" 'filename': filename\n",
	" })\n",
	"\n",
	"# Convert the list of dictionaries to a DataFrame\n",
	"df = pd.DataFrame(data, columns=['text', 'filename'])\n",
	"\n",
	"# Display the DataFrame structure\n",
	"print(df.head())\n"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 5,
	"metadata": {
	"colab": {
	"base_uri": "https://localhost:8080/",
	"height": 122
	},
	"id": "_bjUIZT858Z9",
	"outputId": "bce85972-d3c8-49bd-e226-c6f845695617"
	},
	"outputs": [
	{
	"output_type": "execute_result",
	"data": {
	"text/plain": [
	"'We continue to provide COVID-19 relief around the world through donations of products, cash and personal protective equipment.\\n\\nAnd while the pandemic is rightfully top of mind, there are many people who’ve also faced natural disasters around the world, and we’ve responded to support those dealing with fires, floods, typhoons, hurricanes and other emergencies.\\n\\nOur Children’s Safe Drinking Water Program continues to provide clean drinking water to those who lack access, reaching 18 billion liters of clean water provided since the program began, thanks to the collaboration of more than 150 partners, and we’re on our way to 25 billion liters by 2025.\\n\\nEthics & Corporate Responsibility\\u2009—\\u2009good governance\\u2009—\\u2009 is the foundation for everything we do at P&G, including our Citizenship work.'"
	],
	"application/vnd.google.colaboratory.intrinsic+json": {
	"type": "string"
	}
	},
	"metadata": {},
	"execution_count": 5
	}
	],
	"source": [
	"df.iloc[2000][\"text\"]"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 6,
	"metadata": {
	"colab": {
	"base_uri": "https://localhost:8080/"
	},
	"id": "QWmotMc1443S",
	"outputId": "24e7ee86-6634-46e3-fbf3-059a610dbbf2"
	},
	"outputs": [
	{
	"output_type": "execute_result",
	"data": {
	"text/plain": [
	"(5225, 2)"
	]
	},
	"metadata": {},
	"execution_count": 6
	}
	],
	"source": [
	"df.shape"
	]
	},
	{
	"cell_type": "markdown",
	"source": [
	"You can now save your dataframe as a `csv` file and then use a new notebook for your analysis."
	],
	"metadata": {
	"id": "SDfD_zBYAlMD"
	}
	}
	],
	"metadata": {
	"colab": {
	"provenance": [],
	"authorship_tag": "ABX9TyN9prDp8Vgj/ivfpxj8nf9I",
	"include_colab_link": true
	},
	"kernelspec": {
	"display_name": "Python 3",
	"name": "python3"
	},
	"language_info": {
	"name": "python"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 0
	}