oxinabox/CharacterSplitter.ipynb

## CharacterSplitter.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##A Demonstration of how one can use Python to select chapters of a Ebook based on their POV in title.\n",
    "\n",
    " - The is is a simple demonstration, it only catches the basic case\n",
    " - You'll likely need to customise it on a per-book basis\n",
    " - For demonstration, I am using a particular ebook containing the first for books of George R.R. Matrin's Song of Ice and Fire\n",
    " - It is your responsibility to ensure the legality of this in your local\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "----------------------\n",
    "\n",
    "####The Libraries\n",
    "We are using python3 today, but this code should work almost without change in python2. To libraries are required.\n",
    "\n",
    " - [ebooklib](https://github.com/aerkalov/ebooklib) is for reading and writing the epubs as a whole -- they are basically Zip Archieves\n",
    " - [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/bs4/doc/) is for reading the HTML files within them\n",
    "\n",
    "Both can be installed with `pip`.\n",
    "\n",
    "We are also going to use the standard library component:\n",
    "\n",
    " - [re](https://docs.python.org/3/library/re.html) for regular expressions "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import ebooklib\n",
    "from ebooklib import epub\n",
    "from bs4 import BeautifulSoup\n",
    "import re"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##What to Keep:\n",
    "ebooklib.epub breaks the epub up into items. These are files with in the zip archieve.\n",
    "Generally most booked have one item (ie file), per chapter. That is the case for our book.\n",
    "\n",
    "Of these items, there are three catagories of  item we want to keep:\n",
    "    \n",
    "- items that are not chapters at all -- these could be pictures, or metadata or something else. we don't know.\n",
    "- chapters that are universal, eg the prologue, the dedication or the appendix.\n",
    "- chapter's  that are about the character we are interested in"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def is_not_chapter(item):\n",
    "    return item.get_type() != ebooklib.ITEM_DOCUMENT\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#####Recognising univeral chapters\n",
    "All the normal chapted in out case are named along the lines of: `b01-c01` for book 1 (as it is a complation) chapter 1. Special chapters like the appendix don't follow this pattern. We can check for it with a regex"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def is_univeral_chapter(chapter):\n",
    "    return not re.match(\"(b\\d\\d.c\\d\\d)\",chapter.get_id())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##Is it about our character?\n",
    "\n",
    "In this particular book all the character names are in the chapter headings.\n",
    "However it does represent them in two different ways. In some sections it is with a `<h1>` element, in others in is in a `<p class=\"ct\">` element. We'll check for both. \n",
    "\n",
    "Notice this function is a higher order function that returns a function. That makes it work nice with filter -- useful for testing, if you've already stripped down to just the normal chapters.\n",
    "\n",
    "`filter(is_character(\"JON\"), chapters)`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def is_character(name):\n",
    "    def inner(chapter):\n",
    "        soup = BeautifulSoup(chapter.get_content())\n",
    "        heading_matchers = [lambda: soup.find_all('h1'),\n",
    "                            lambda: soup.find_all(class_='ct')\n",
    "                           ]\n",
    "        headings=[]\n",
    "        for matcher in heading_matchers:\n",
    "            headings = matcher()\n",
    "            if len(headings)>0: break\n",
    "        else:\n",
    "            return False\n",
    "        \n",
    "        assert(len(headings)==1)\n",
    "        heading = headings[0]\n",
    "        chapter_character_name = heading.text.strip()\n",
    "        return chapter_character_name == name\n",
    "        \n",
    "    return inner"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###Bring our conditions together\n",
    "Another higher order function, again to make it work with `filter`.\n",
    "In this case it is a closure."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def keep_item(character_name):\n",
    "    is_our_charatacter = is_character(character_name)\n",
    "    def inner(item):\n",
    "        return (is_not_chapter(item) \n",
    "                or is_univeral_chapter(item) \n",
    "                or is_our_charatacter(item))\n",
    "    return inner\n",
    "        \n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###Combine it all, with a read and a write\n",
    "Also we'll modify the title, don't want to get them confused.\n",
    "There is also a helper function below to workout the new filename"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "def rewrite_book_by_character(filename, character):\n",
    "    book = epub.read_epub(filename)\n",
    "    book.items = list(filter(keep_item(character), book.items))\n",
    "    book.title+=\": \" + character + \"POVs_ONLY\"\n",
    "    \n",
    "    new_filename = get_new_filename(filename,character)\n",
    "    epub.write_epub(new_filename, book, {})\n",
    "    return new_filename\n",
    "\n",
    "def get_new_filename(filename,character):\n",
    "    import os.path\n",
    "    filename_base, ext = os.path.splitext(filename)\n",
    "    new_filename = filename_base +\"_\" + character+\"_ONLY\"+ext\n",
    "    return new_filename\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "##Git it a go"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<a href='asoiaf01-04_JON_ONLY.epub' target='_blank'>asoiaf01-04_JON_ONLY.epub</a><br>"
      ],
      "text/plain": [
       "/home/wheel/oxinabox/programming/book_char_split/asoiaf01-04_JON_ONLY.epub"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from IPython.display import FileLink\n",
    "filename = rewrite_book_by_character('asoiaf01-04.epub', \"JON\")\n",
    "FileLink(filename)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "####The MIT License (MIT)\n",
    "\n",
    "Copyright (c) 2015 Lyndon White\n",
    "\n",
    "Permission is hereby granted, free of charge, to any person obtaining a copy\n",
    "of this software and associated documentation files (the \"Software\"), to deal\n",
    "in the Software without restriction, including without limitation the rights\n",
    "to use, copy, modify, merge, publish, distribute, sublicense, and/or sell\n",
    "copies of the Software, and to permit persons to whom the Software is\n",
    "furnished to do so, subject to the following conditions:\n",
    "\n",
    "The above copyright notice and this permission notice shall be included in\n",
    "all copies or substantial portions of the Software.\n",
    "\n",
    "THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\n",
    "IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\n",
    "FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\n",
    "AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\n",
    "LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\n",
    "OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN\n",
    "THE SOFTWARE."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.4.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"##A Demonstration of how one can use Python to select chapters of a Ebook based on their POV in title.\n",
	"\n",
	" - The is is a simple demonstration, it only catches the basic case\n",
	" - You'll likely need to customise it on a per-book basis\n",
	" - For demonstration, I am using a particular ebook containing the first for books of George R.R. Matrin's Song of Ice and Fire\n",
	" - It is your responsibility to ensure the legality of this in your local\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"----------------------\n",
	"\n",
	"####The Libraries\n",
	"We are using python3 today, but this code should work almost without change in python2. To libraries are required.\n",
	"\n",
	" - [ebooklib](https://github.com/aerkalov/ebooklib) is for reading and writing the epubs as a whole -- they are basically Zip Archieves\n",
	" - [BeautifulSoup](http://www.crummy.com/software/BeautifulSoup/bs4/doc/) is for reading the HTML files within them\n",
	"\n",
	"Both can be installed with `pip`.\n",
	"\n",
	"We are also going to use the standard library component:\n",
	"\n",
	" - [re](https://docs.python.org/3/library/re.html) for regular expressions "
	]
	},
	{
	"cell_type": "code",
	"execution_count": 22,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"import ebooklib\n",
	"from ebooklib import epub\n",
	"from bs4 import BeautifulSoup\n",
	"import re"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"##What to Keep:\n",
	"ebooklib.epub breaks the epub up into items. These are files with in the zip archieve.\n",
	"Generally most booked have one item (ie file), per chapter. That is the case for our book.\n",
	"\n",
	"Of these items, there are three catagories of item we want to keep:\n",
	" \n",
	"- items that are not chapters at all -- these could be pictures, or metadata or something else. we don't know.\n",
	"- chapters that are universal, eg the prologue, the dedication or the appendix.\n",
	"- chapter's that are about the character we are interested in"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 23,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": [
	"def is_not_chapter(item):\n",
	" return item.get_type() != ebooklib.ITEM_DOCUMENT\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"#####Recognising univeral chapters\n",
	"All the normal chapted in out case are named along the lines of: `b01-c01` for book 1 (as it is a complation) chapter 1. Special chapters like the appendix don't follow this pattern. We can check for it with a regex"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 24,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"def is_univeral_chapter(chapter):\n",
	" return not re.match(\"(b\\d\\d.c\\d\\d)\",chapter.get_id())"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"##Is it about our character?\n",
	"\n",
	"In this particular book all the character names are in the chapter headings.\n",
	"However it does represent them in two different ways. In some sections it is with a `<h1>` element, in others in is in a `<p class=\"ct\">` element. We'll check for both. \n",
	"\n",
	"Notice this function is a higher order function that returns a function. That makes it work nice with filter -- useful for testing, if you've already stripped down to just the normal chapters.\n",
	"\n",
	"`filter(is_character(\"JON\"), chapters)`"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 25,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"def is_character(name):\n",
	" def inner(chapter):\n",
	" soup = BeautifulSoup(chapter.get_content())\n",
	" heading_matchers = [lambda: soup.find_all('h1'),\n",
	" lambda: soup.find_all(class_='ct')\n",
	" ]\n",
	" headings=[]\n",
	" for matcher in heading_matchers:\n",
	" headings = matcher()\n",
	" if len(headings)>0: break\n",
	" else:\n",
	" return False\n",
	" \n",
	" assert(len(headings)==1)\n",
	" heading = headings[0]\n",
	" chapter_character_name = heading.text.strip()\n",
	" return chapter_character_name == name\n",
	" \n",
	" return inner"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"###Bring our conditions together\n",
	"Another higher order function, again to make it work with `filter`.\n",
	"In this case it is a closure."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 26,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"def keep_item(character_name):\n",
	" is_our_charatacter = is_character(character_name)\n",
	" def inner(item):\n",
	" return (is_not_chapter(item) \n",
	" or is_univeral_chapter(item) \n",
	" or is_our_charatacter(item))\n",
	" return inner\n",
	" \n",
	" "
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"###Combine it all, with a read and a write\n",
	"Also we'll modify the title, don't want to get them confused.\n",
	"There is also a helper function below to workout the new filename"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 27,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"def rewrite_book_by_character(filename, character):\n",
	" book = epub.read_epub(filename)\n",
	" book.items = list(filter(keep_item(character), book.items))\n",
	" book.title+=\": \" + character + \"POVs_ONLY\"\n",
	" \n",
	" new_filename = get_new_filename(filename,character)\n",
	" epub.write_epub(new_filename, book, {})\n",
	" return new_filename\n",
	"\n",
	"def get_new_filename(filename,character):\n",
	" import os.path\n",
	" filename_base, ext = os.path.splitext(filename)\n",
	" new_filename = filename_base +\"_\" + character+\"_ONLY\"+ext\n",
	" return new_filename\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {
	"collapsed": true
	},
	"source": [
	"##Git it a go"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 28,
	"metadata": {
	"collapsed": false
	},
	"outputs": [
	{
	"data": {
	"text/html": [
	"<a href='asoiaf01-04_JON_ONLY.epub' target='_blank'>asoiaf01-04_JON_ONLY.epub</a><br>"
	],
	"text/plain": [
	"/home/wheel/oxinabox/programming/book_char_split/asoiaf01-04_JON_ONLY.epub"
	]
	},
	"execution_count": 28,
	"metadata": {},
	"output_type": "execute_result"
	}
	],
	"source": [
	"from IPython.display import FileLink\n",
	"filename = rewrite_book_by_character('asoiaf01-04.epub', \"JON\")\n",
	"FileLink(filename)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": []
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"####The MIT License (MIT)\n",
	"\n",
	"Copyright (c) 2015 Lyndon White\n",
	"\n",
	"Permission is hereby granted, free of charge, to any person obtaining a copy\n",
	"of this software and associated documentation files (the \"Software\"), to deal\n",
	"in the Software without restriction, including without limitation the rights\n",
	"to use, copy, modify, merge, publish, distribute, sublicense, and/or sell\n",
	"copies of the Software, and to permit persons to whom the Software is\n",
	"furnished to do so, subject to the following conditions:\n",
	"\n",
	"The above copyright notice and this permission notice shall be included in\n",
	"all copies or substantial portions of the Software.\n",
	"\n",
	"THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\n",
	"IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\n",
	"FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\n",
	"AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\n",
	"LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\n",
	"OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN\n",
	"THE SOFTWARE."
	]
	},
	{
	"cell_type": "code",
	"execution_count": null,
	"metadata": {
	"collapsed": true
	},
	"outputs": [],
	"source": []
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.4.3"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 0
	}