Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 3 You must be signed in to fork a gist
  • Save aparrish/d34d6fd2fe2ee180ac4ff6ff4ec0711a to your computer and use it in GitHub Desktop.
Save aparrish/d34d6fd2fe2ee180ac4ff6ff4ec0711a to your computer and use it in GitHub Desktop.
Extracting conversations from Project Gutenberg. Code examples released under CC0 https://creativecommons.org/choose/zero/, other text released under CC BY 4.0 https://creativecommons.org/licenses/by/4.0/
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Extracting conversations from Project Gutenberg\n",
"\n",
"By [Allison Parrish](http://www.decontextualize.com/)\n",
"\n",
"This is some simple code to extract \"conversations\" in plain text files from Project Gutenberg and write them out as a plain text file. The code operates purely on rules; there's no fancy machine learning model. Here are the assumptions:\n",
"\n",
"* Paragraphs in a text are separated by new lines;\n",
"* Quotes start and end with either `\"` or `”`;\n",
"* Inside of a paragraph, any text enclosed in quotes constitutes part of a conversational turn;\n",
"* All of the quoted text inside a paragraph is spoken by one speaker;\n",
"* Conversations consist of quotes from consecutive paragraphs containing quotes;\n",
"* A conversation ends when there's an intervening paragraph with no quotes in it.\n",
"\n",
"These aren't great assumptions, but for my current purposes it's *good enough*(tm).\n",
"\n",
"To start, go to [Project Gutenberg](http://www.gutenberg.org/) and download a plain text file (usually labelled \"UTF8\"). Before you continue, check to make sure that the quotations in the file corresponds to the assumptions above, and pick a different text if that isn't the case. Put the name of the file in the quotes below. (I'm using [this file](http://www.gutenberg.org/cache/epub/27200/pg27200.txt).)"
]
},
{
"cell_type": "code",
"execution_count": 140,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"filename = \"pg27200.txt\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Getting paragraphs\n",
"\n",
"The function below takes a list of lines of text and then returns *paragraphs*. For the purposes of this notebook, a paragraph is just any stretch of lines separated by an empty line."
]
},
{
"cell_type": "code",
"execution_count": 141,
"metadata": {},
"outputs": [],
"source": [
"def get_paragraphs(lines):\n",
" paragraphs = []\n",
" current_para = \"\"\n",
" for line in lines:\n",
" if len(line.strip()) > 0:\n",
" current_para += line\n",
" else:\n",
" paragraphs.append(current_para.strip())\n",
" current_para = \"\"\n",
" paragraphs.append(current_para.strip())\n",
" return paragraphs"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Run the cell below to open the file you just downloaded and get its paragraphs. It'll display the total number of paragraphs that were found."
]
},
{
"cell_type": "code",
"execution_count": 142,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"5563"
]
},
"execution_count": 142,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"paragraphs = get_paragraphs(open(filename).readlines())\n",
"len(paragraphs)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Just to spot-check, the cell below picks a paragraph at random and displays it. Run it a few times to make sure the code is producing satisfactory results."
]
},
{
"cell_type": "code",
"execution_count": 144,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"\"I do not understand you,\" said Death. \"Will you have your child\n",
"back? or shall I carry him away to a place that you do not know?\"\n"
]
}
],
"source": [
"import random\n",
"print(random.choice(paragraphs))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Getting conversations\n",
"\n",
"This code extracts conversations using regular expressions. It basically just finds any text that is between quotes in a paragraph and mashes them together to form one \"turn.\" Subsequent paragraphs with quotations are considered to be a \"conversation.\" A file might have multiple conversations, so this function returns a list of lists."
]
},
{
"cell_type": "code",
"execution_count": 145,
"metadata": {
"scrolled": false
},
"outputs": [],
"source": [
"import re\n",
"def get_conversations(paragraphs, quotechrs = '”\"'):\n",
" conversations = []\n",
" current = []\n",
" for p in paragraphs:\n",
" quote_parts = [item.strip(quotechrs).replace(\"\\n\", \" \")\n",
" for item in re.findall(r\"[%s]\\S[^%s]*[%s]\" % (quotechrs,quotechrs,quotechrs), p)]\n",
" if len(quote_parts) > 0:\n",
" # replace comma at end with period\n",
" quote_parts = [re.sub(r\",$\", \".\", item) for item in quote_parts]\n",
" current.append(\" \".join(quote_parts))\n",
" elif len(current) > 0:\n",
" conversations.append(current[:])\n",
" current = []\n",
" conversations.append(current[:])\n",
" return conversations"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Get the conversations and see how many there are:"
]
},
{
"cell_type": "code",
"execution_count": 146,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1003"
]
},
"execution_count": 146,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"conversations = get_conversations(paragraphs)\n",
"len(conversations)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And now print out a random conversation:"
]
},
{
"cell_type": "code",
"execution_count": 154,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"- Shall we be hanged and roasted?\n",
"- No, certainly not. I will teach you to fly, and when you have learnt, we will fly into the meadows, and pay a visit to the frogs, who will bow themselves to us in the water, and cry 'Croak, croak,' and then we shall eat them up; that will be fun.\n",
"- And what next?\n",
"- Then. all the storks in the country will assemble together, and go through their autumn manoeuvres, so that it is very important for every one to know how to fly properly. If they do not, the general will thrust them through with his beak, and kill them. Therefore you must take pains and learn, so as to be ready when the drilling begins.\n",
"- Then we may be killed after all, as the boys say; and hark! they are singing again.\n",
"- Listen to me, and not to them. After the great review is over, we shall fly away to warm countries far from hence, where there are mountains and forests. To Egypt, where we shall see three-cornered houses built of stone, with pointed tops that reach nearly to the clouds. They are called Pyramids, and are older than a stork could imagine; and in that country, there is a river that overflows its banks, and then goes back, leaving nothing but mire; there we can walk about, and eat frogs in abundance.\n",
"- Oh, o--h!\n",
"- Yes, it is a delightful place; there is nothing to do all day long but eat, and while we are so well off out there, in this country there will not be a single green leaf on the trees, and the weather will be so cold that the clouds will freeze, and fall on the earth in little white rags.\n",
"- Will the naughty boys freeze and fall in pieces?\n",
"- No, they will not freeze and fall into pieces. but they will be very cold, and be obliged to sit all day in a dark, gloomy room, while we shall be flying about in foreign lands, where there are blooming flowers and warm sunshine.\n"
]
}
],
"source": [
"for turn in random.choice(conversations):\n",
" print(\"-\", turn)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Writing conversations to a file\n",
"\n",
"The following code takes the conversations found by the code above and writes them to a file, with lines from each conversation in order and a blank line between each conversation. First, set the name of the file here:"
]
},
{
"cell_type": "code",
"execution_count": 155,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"output_file = \"gutenberg_conversations.txt\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This cell defines a function that writes conversations to an open file handle:"
]
},
{
"cell_type": "code",
"execution_count": 156,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"def write_conversations(fh, conversations):\n",
" for conv in conversations:\n",
" for line in conv:\n",
" print(line, file=fh)\n",
" print(\"\", file=fh)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Finally, this cell runs everything together to produce the output."
]
},
{
"cell_type": "code",
"execution_count": 157,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"with open(output_file, \"w\") as fh:\n",
" write_conversations(fh,\n",
" get_conversations(\n",
" get_paragraphs(open(filename).readlines())))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Check the output with the command-line `head` command to make sure it worked:"
]
},
{
"cell_type": "code",
"execution_count": 158,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"God's kindness to us men is beyond all limits. God, Thy kindness towards us all is without limits.\r\n",
"\r\n",
"What is the matter with you?\r\n",
"Well, the matter with me is. that I cannot collect my thoughts, and am unable to grasp the meaning of what you said to-day in church--that there are so many wicked people, and that they should burn eternally. Alas! eternally--how long! I am only a woman and a sinner before God, but I should not have the heart to let even the worst sinner burn for ever, and how could our Lord to do so, who is so infinitely good, and who knows how the wickedness comes from without and within? No, I am unable to imagine that, although you say so.\r\n",
"\r\n",
"If any one shall find rest in the grave and mercy before our Lord you shall certainly do so.\r\n",
"\r\n",
"Not even you can find eternal rest! You suffer, you best and most pious woman?\r\n",
"Yes.\r\n",
"And can I not obtain rest in the grave for you?\r\n",
"Yes.\r\n",
"And how?\r\n",
"Give me one hair--only one single hair--from the head of the sinner for whom the fire shall never be extinguished, of the sinner whom God will condemn to eternal punishment in hell.\r\n",
"Yes, one ought to be able to redeem you so easily, you pure, pious woman.\r\n",
"Follow me. It is thus granted to us. By my side you will be able to fly wherever your thoughts wish to go. Invisible to men, we shall penetrate into their most secret chambers; but with sure hand you must find out him who is destined to eternal torture, and before the cock crows he must be found!\r\n",
"Yes, therein, as I believed, as I knew it. are living those who are abandoned to the eternal fire.\r\n",
"Our ball can compare favourably with the king's. Miserable beggars, who are looking in, you are nothing in comparison to me.\r\n",
"Pride. do you see him?\r\n",
"The footman? He is but a poor fool, and not doomed to be tortured eternally by fire!\r\n",
"Only a fool!\r\n",
"\r\n",
"He is ill! That is madness--a joyless madness--besieged by fear and dreadful dreams!\r\n",
"\r\n",
"Be quiet, monster--sleep! This happens every night!\r\n",
"Every night! Yes, every night he comes and tortures me! In my violence I have done this and that. I was born with an evil mind, which has brought me hither for the second time; but if I have done wrong I suffer punishment for it. One thing, however, I have not yet confessed. When I came out a little while ago, and passed by the yard of my former master, evil thoughts rose within me when I remembered this and that. I struck a match a little bit on the wall; probably it came a little too close to the thatched roof. All burnt down--a great heat rose, such as sometimes overcomes me. I myself helped to rescue cattle and things, nothing alive burnt, except a flight of pigeons, which flew into the fire, and the yard dog, of which I had not thought; one could hear him howl out of the fire, and this howling I still hear when I wish to sleep; and when I have fallen asleep, the great rough dog comes and places himself upon me, and howls, presses, and tortures me. Now listen to what I tell you! You can snore; you are snoring the whole night, and I hardly a quarter of an hour!\r\n"
]
}
],
"source": [
"!head -25 $output_file"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you're using macOS and running this notebook locally, you can run the following cell to open a Finder window to the current directory, where you should be able to see your file:"
]
},
{
"cell_type": "code",
"execution_count": 139,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"!open ."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment