Skip to content

Instantly share code, notes, and snippets.

@BarisSari
Last active July 10, 2021 21:14
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save BarisSari/1b9c73f8a433cdbb0da0d2d9fa376691 to your computer and use it in GitHub Desktop.
Save BarisSari/1b9c73f8a433cdbb0da0d2d9fa376691 to your computer and use it in GitHub Desktop.
sentence-tokenization.ipynb
Display the source blob
Display the rendered blob
Raw
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "sentence-tokenization.ipynb",
"provenance": [],
"collapsed_sections": [],
"authorship_tag": "ABX9TyP0UwTWOzv2EhDJbdGnQwTR",
"include_colab_link": true
},
"kernelspec": {
"name": "python3",
"display_name": "Python 3"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "view-in-github",
"colab_type": "text"
},
"source": [
"<a href=\"https://colab.research.google.com/gist/BarisSari/1b9c73f8a433cdbb0da0d2d9fa376691/sentence-tokenization.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
]
},
{
"cell_type": "code",
"metadata": {
"id": "SGchc1tMDfCX"
},
"source": [
"!pip install pandas nltk spacy"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "0VCOMiHrEqRQ",
"outputId": "09487e71-4192-4edd-b881-139fcc8f8873"
},
"source": [
"import nltk\n",
"\n",
"# Download Punkt Sentence Tokenizer\n",
"nltk.download('punkt')"
],
"execution_count": 12,
"outputs": [
{
"output_type": "stream",
"text": [
"[nltk_data] Downloading package punkt to /root/nltk_data...\n",
"[nltk_data] Package punkt is already up-to-date!\n"
],
"name": "stdout"
},
{
"output_type": "execute_result",
"data": {
"text/plain": [
"True"
]
},
"metadata": {
"tags": []
},
"execution_count": 12
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "HLovth7afRFr"
},
"source": [
"import os\n",
"\n",
"# On Jupyter Notebook or Colab\n",
"DIR_PATH = os.getcwd()\n",
"# On Python module\n",
"#DIR_PATH = os.path.dirname(__file__)\n",
"\n",
"FILE_PATH = os.path.join(DIR_PATH, \"scripts.csv\")"
],
"execution_count": 13,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "eJJrXlJID1bg",
"outputId": "69824a34-4eee-4efa-ddec-45f24953dabf"
},
"source": [
"import pandas as pd\n",
"\n",
"# Read the file\n",
"df = pd.read_csv(FILE_PATH)\n",
"# Remove NaN values\n",
"df = df[~df[\"Dialogue\"].isna()]\n",
"# Assign first_dialogue to the first row's \"Dialogue\" column\n",
"first_dialogue = df.loc[0, \"Dialogue\"]\n",
"print(first_dialogue)"
],
"execution_count": 14,
"outputs": [
{
"output_type": "stream",
"text": [
"Do you know what this is all about? Do you know, why were here? To be out, this is out...and out is one of the single most enjoyable experiences of life. People...did you ever hear people talking about We should go out? This is what theyre talking about...this whole thing, were all out now, no one is home. Not one person here is home, were all out! There are people tryin to find us, they dont know where we are. (on an imaginary phone) Did you ring?, I cant find him. Where did he go? He didnt tell me where he was going. He must have gone out. You wanna go out you get ready, you pick out the clothes, right? You take the shower, you get all ready, get the cash, get your friends, the car, the spot, the reservation...Then youre standing around, whatta you do? You go We gotta be getting back. Once youre out, you wanna get back! You wanna go to sleep, you wanna get up, you wanna go out again tomorrow, right? Where ever you are in life, its my feeling, youve gotta go.\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 204
},
"id": "HmqACyrSD5-f",
"outputId": "cd478bef-63d0-40a9-e154-9852f2c93bdc"
},
"source": [
"df.head()"
],
"execution_count": 15,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Unnamed: 0</th>\n",
" <th>Character</th>\n",
" <th>Dialogue</th>\n",
" <th>EpisodeNo</th>\n",
" <th>SEID</th>\n",
" <th>Season</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>JERRY</td>\n",
" <td>Do you know what this is all about? Do you kno...</td>\n",
" <td>1.0</td>\n",
" <td>S01E01</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1</td>\n",
" <td>JERRY</td>\n",
" <td>(pointing at Georges shirt) See, to me, that b...</td>\n",
" <td>1.0</td>\n",
" <td>S01E01</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2</td>\n",
" <td>GEORGE</td>\n",
" <td>Are you through?</td>\n",
" <td>1.0</td>\n",
" <td>S01E01</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>3</td>\n",
" <td>JERRY</td>\n",
" <td>You do of course try on, when you buy?</td>\n",
" <td>1.0</td>\n",
" <td>S01E01</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>4</td>\n",
" <td>GEORGE</td>\n",
" <td>Yes, it was purple, I liked it, I dont actuall...</td>\n",
" <td>1.0</td>\n",
" <td>S01E01</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Unnamed: 0 Character ... SEID Season\n",
"0 0 JERRY ... S01E01 1.0\n",
"1 1 JERRY ... S01E01 1.0\n",
"2 2 GEORGE ... S01E01 1.0\n",
"3 3 JERRY ... S01E01 1.0\n",
"4 4 GEORGE ... S01E01 1.0\n",
"\n",
"[5 rows x 6 columns]"
]
},
"metadata": {
"tags": []
},
"execution_count": 15
}
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "wkD7e6AdEEQ-",
"outputId": "10afeda4-3c6c-4860-99ba-38296c1e963e"
},
"source": [
"# use Python's split\n",
"first_dialogue.split(\".\")"
],
"execution_count": 16,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"['Do you know what this is all about? Do you know, why were here? To be out, this is out',\n",
" '',\n",
" '',\n",
" 'and out is one of the single most enjoyable experiences of life',\n",
" ' People',\n",
" '',\n",
" '',\n",
" 'did you ever hear people talking about We should go out? This is what theyre talking about',\n",
" '',\n",
" '',\n",
" 'this whole thing, were all out now, no one is home',\n",
" ' Not one person here is home, were all out! There are people tryin to find us, they dont know where we are',\n",
" ' (on an imaginary phone) Did you ring?, I cant find him',\n",
" ' Where did he go? He didnt tell me where he was going',\n",
" ' He must have gone out',\n",
" ' You wanna go out you get ready, you pick out the clothes, right? You take the shower, you get all ready, get the cash, get your friends, the car, the spot, the reservation',\n",
" '',\n",
" '',\n",
" 'Then youre standing around, whatta you do? You go We gotta be getting back',\n",
" ' Once youre out, you wanna get back! You wanna go to sleep, you wanna get up, you wanna go out again tomorrow, right? Where ever you are in life, its my feeling, youve gotta go',\n",
" '']"
]
},
"metadata": {
"tags": []
},
"execution_count": 16
}
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "ipRRtATkEW7l",
"outputId": "c407388d-52c0-43e7-a96d-49757faac9d3"
},
"source": [
"from nltk.tokenize import sent_tokenize\n",
"\n",
"sent_tokenize(first_dialogue)"
],
"execution_count": 17,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"['Do you know what this is all about?',\n",
" 'Do you know, why were here?',\n",
" 'To be out, this is out...and out is one of the single most enjoyable experiences of life.',\n",
" 'People...did you ever hear people talking about We should go out?',\n",
" 'This is what theyre talking about...this whole thing, were all out now, no one is home.',\n",
" 'Not one person here is home, were all out!',\n",
" 'There are people tryin to find us, they dont know where we are.',\n",
" '(on an imaginary phone) Did you ring?, I cant find him.',\n",
" 'Where did he go?',\n",
" 'He didnt tell me where he was going.',\n",
" 'He must have gone out.',\n",
" 'You wanna go out you get ready, you pick out the clothes, right?',\n",
" 'You take the shower, you get all ready, get the cash, get your friends, the car, the spot, the reservation...Then youre standing around, whatta you do?',\n",
" 'You go We gotta be getting back.',\n",
" 'Once youre out, you wanna get back!',\n",
" 'You wanna go to sleep, you wanna get up, you wanna go out again tomorrow, right?',\n",
" 'Where ever you are in life, its my feeling, youve gotta go.']"
]
},
"metadata": {
"tags": []
},
"execution_count": 17
}
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "N7PqnpEygfdw",
"outputId": "b4ff9cda-13a4-4042-ed16-eb571c461920"
},
"source": [
"%%timeit -n 10\n",
"df.loc[:5000, \"Dialogue\"].apply(lambda x: sent_tokenize(x))"
],
"execution_count": 18,
"outputs": [
{
"output_type": "stream",
"text": [
"10 loops, best of 3: 236 ms per loop\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "Bx5WECM2FQIb",
"outputId": "703aba6d-1d96-4cd8-ed84-9d134a022d7a"
},
"source": [
"import spacy\n",
"# use spacy with the dependency parse \n",
"nlp = spacy.load(\"en_core_web_sm\")\n",
"[str(sent) for sent in nlp(first_dialogue).sents]"
],
"execution_count": 19,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"['Do you know what this is all about?',\n",
" 'Do you know, why were here?',\n",
" 'To be out, this is out...and out is one of the single most enjoyable experiences of life.',\n",
" 'People...did you ever hear people talking about We should go out?',\n",
" 'This is what theyre talking about...',\n",
" 'this whole thing, were all out now, no one is home.',\n",
" 'Not one person here is home, were all out!',\n",
" 'There are people tryin to find us, they dont know where we are.',\n",
" '(on an imaginary phone) Did you ring?',\n",
" ', I cant find him.',\n",
" 'Where did he go?',\n",
" 'He didnt tell me where he was going.',\n",
" 'He must have gone out.',\n",
" 'You wanna go out',\n",
" 'you get ready',\n",
" ', you pick out the clothes, right?',\n",
" 'You take the shower, you get all ready, get the cash, get your friends, the car, the spot, the reservation...',\n",
" 'Then youre standing around, whatta you do?',\n",
" 'You go',\n",
" 'We gotta be getting back.',\n",
" 'Once youre out, you wanna get back!',\n",
" 'You wanna go to sleep, you wanna get up, you wanna go out again tomorrow,',\n",
" 'right?',\n",
" 'Where ever you are in life, its my feeling, youve gotta go.']"
]
},
"metadata": {
"tags": []
},
"execution_count": 19
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "mWuGSjNdMJI7"
},
"source": [
"#%%timeit -n 10\n",
"# WARNING: takes a long time!\n",
"#df.loc[:5000, \"Dialogue\"].apply(lambda x: [sent.text for sent in nlp(x).sents])"
],
"execution_count": 21,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "ToodwNUzLkBZ",
"outputId": "e6db702a-760a-4bab-ac5d-6645344cff3b"
},
"source": [
"from spacy.lang.en import English\n",
"# use spacy with the sentencizer\n",
"nlp = English() # just the language with no model\n",
"sentencizer = nlp.create_pipe(\"sentencizer\")\n",
"nlp.add_pipe(sentencizer)\n",
"[str(sent) for sent in nlp(first_dialogue).sents]"
],
"execution_count": 22,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"['Do you know what this is all about?',\n",
" 'Do you know, why were here?',\n",
" 'To be out, this is out...and out is one of the single most enjoyable experiences of life.',\n",
" 'People...did you ever hear people talking about We should go out?',\n",
" 'This is what theyre talking about...this whole thing, were all out now, no one is home.',\n",
" 'Not one person here is home, were all out!',\n",
" 'There are people tryin to find us, they dont know where we are. (',\n",
" 'on an imaginary phone) Did you ring?,',\n",
" 'I cant find him.',\n",
" 'Where did he go?',\n",
" 'He didnt tell me where he was going.',\n",
" 'He must have gone out.',\n",
" 'You wanna go out you get ready, you pick out the clothes, right?',\n",
" 'You take the shower, you get all ready, get the cash, get your friends, the car, the spot, the reservation...Then youre standing around, whatta you do?',\n",
" 'You go We gotta be getting back.',\n",
" 'Once youre out, you wanna get back!',\n",
" 'You wanna go to sleep, you wanna get up, you wanna go out again tomorrow, right?',\n",
" 'Where ever you are in life, its my feeling, youve gotta go.']"
]
},
"metadata": {
"tags": []
},
"execution_count": 22
}
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "2pXpftxcnlfz",
"outputId": "eada0a94-363b-4204-8b2e-79f598ad82c3"
},
"source": [
"%%timeit -n 10\n",
"df.loc[:5000, \"Dialogue\"].apply(lambda x: [sent.text for sent in nlp(x).sents])"
],
"execution_count": 23,
"outputs": [
{
"output_type": "stream",
"text": [
"10 loops, best of 3: 219 ms per loop\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "QvYfpERJIRKT",
"outputId": "bc4a6452-0289-49fd-c1bd-8a1bd6e4b56d"
},
"source": [
"import re\n",
"# use regular expression\n",
"rule = r\"(?<!\\w\\.\\w.)(?<![A-Z][a-z]\\.)(?<=\\.|\\?)\\s\"\n",
"re.split(rule, first_dialogue)"
],
"execution_count": 24,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": [
"['Do you know what this is all about?',\n",
" 'Do you know, why were here?',\n",
" 'To be out, this is out...and out is one of the single most enjoyable experiences of life.',\n",
" 'People...did you ever hear people talking about We should go out?',\n",
" 'This is what theyre talking about...this whole thing, were all out now, no one is home.',\n",
" 'Not one person here is home, were all out! There are people tryin to find us, they dont know where we are.',\n",
" '(on an imaginary phone) Did you ring?, I cant find him.',\n",
" 'Where did he go?',\n",
" 'He didnt tell me where he was going.',\n",
" 'He must have gone out.',\n",
" 'You wanna go out you get ready, you pick out the clothes, right?',\n",
" 'You take the shower, you get all ready, get the cash, get your friends, the car, the spot, the reservation...Then youre standing around, whatta you do?',\n",
" 'You go We gotta be getting back.',\n",
" 'Once youre out, you wanna get back! You wanna go to sleep, you wanna get up, you wanna go out again tomorrow, right?',\n",
" 'Where ever you are in life, its my feeling, youve gotta go.']"
]
},
"metadata": {
"tags": []
},
"execution_count": 24
}
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "tfSscvTri0Wk",
"outputId": "de47bb32-b4da-485e-ea69-94ee8ebcab10"
},
"source": [
"%%timeit -n 10\n",
"df.loc[:5000, \"Dialogue\"].apply(lambda x: re.split(rule, x))"
],
"execution_count": 25,
"outputs": [
{
"output_type": "stream",
"text": [
"10 loops, best of 3: 26.2 ms per loop\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/"
},
"id": "jxkl0c-YjGPI",
"outputId": "38d5be4f-4566-4c8e-ce97-32360133d6f7"
},
"source": [
"%%timeit -n 10\n",
"# Without dependency parser\n",
"tokenized_data = []\n",
"first_5000_rows = df.loc[:5000, \"Dialogue\"]\n",
"for doc in nlp.pipe(first_5000_rows, batch_size=20):\n",
" tokenized_data.append([sent.text for sent in doc.sents])"
],
"execution_count": 27,
"outputs": [
{
"output_type": "stream",
"text": [
"10 loops, best of 3: 186 ms per loop\n"
],
"name": "stdout"
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "7xQf3YpHhGnj"
},
"source": [
"# Transform data using spaCy\n",
"nlp = spacy.load(\"en_core_web_sm\")\n",
"# WARNING: takes a long time!\n",
"df[\"Dialogue\"] = df[\"Dialogue\"].apply(lambda x: [sent.text for sent in nlp(x).sents])"
],
"execution_count": 28,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "rcgRauYoiL9L"
},
"source": [
"df = df.explode(\"Dialogue\", ignore_index=True)"
],
"execution_count": 29,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"id": "6wgo87wAzR20"
},
"source": [
"df.rename(columns={\"Unnamed: 0\": \"Dialogue ID\"}, inplace=True)\n",
"df.index.name = \"Sentence ID\""
],
"execution_count": 30,
"outputs": []
},
{
"cell_type": "code",
"metadata": {
"colab": {
"base_uri": "https://localhost:8080/",
"height": 235
},
"id": "5T3ooPLszySR",
"outputId": "db4f4985-bedb-4efd-ccb2-8f83f2f1400b"
},
"source": [
"df.head()"
],
"execution_count": 31,
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Dialogue ID</th>\n",
" <th>Character</th>\n",
" <th>Dialogue</th>\n",
" <th>EpisodeNo</th>\n",
" <th>SEID</th>\n",
" <th>Season</th>\n",
" </tr>\n",
" <tr>\n",
" <th>Sentence ID</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>JERRY</td>\n",
" <td>Do you know what this is all about?</td>\n",
" <td>1.0</td>\n",
" <td>S01E01</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>0</td>\n",
" <td>JERRY</td>\n",
" <td>Do you know, why were here?</td>\n",
" <td>1.0</td>\n",
" <td>S01E01</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>0</td>\n",
" <td>JERRY</td>\n",
" <td>To be out, this is out...and out is one of the...</td>\n",
" <td>1.0</td>\n",
" <td>S01E01</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>0</td>\n",
" <td>JERRY</td>\n",
" <td>People...did you ever hear people talking abou...</td>\n",
" <td>1.0</td>\n",
" <td>S01E01</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>0</td>\n",
" <td>JERRY</td>\n",
" <td>This is what theyre talking about...</td>\n",
" <td>1.0</td>\n",
" <td>S01E01</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Dialogue ID Character ... SEID Season\n",
"Sentence ID ... \n",
"0 0 JERRY ... S01E01 1.0\n",
"1 0 JERRY ... S01E01 1.0\n",
"2 0 JERRY ... S01E01 1.0\n",
"3 0 JERRY ... S01E01 1.0\n",
"4 0 JERRY ... S01E01 1.0\n",
"\n",
"[5 rows x 6 columns]"
]
},
"metadata": {
"tags": []
},
"execution_count": 31
}
]
},
{
"cell_type": "code",
"metadata": {
"id": "UsTZ2qZwz2MX"
},
"source": [
"df.to_csv(\"scripts_tokenized.csv\")"
],
"execution_count": 32,
"outputs": []
}
]
}
@EKELE-NNOROM
Copy link

Great stuff, would really appreciate df.apply(fn) on a column with fn being a set of functions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment