rtpg/e-Book Generation.ipynb

## e-Book Generation.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Easy EPUB generation\n",
    "=====================\n",
    " \n",
    "  This is a quick script I use to build e-books for my Kindle based off of some content I have. Sometimes it's blogs, sometimes it's some text files I have lying about. It's usually simple HTML that I can easily throw into a book.\n",
    "  \n",
    "  Feel free to use this for your own needs! Some more explanation in this [blog post](http://rtpg.co/2016/09/17/make-an-ebook.html)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Get the Content\n",
    "=============\n",
    "\n",
    " Try to make sure you get permission before scraping someone's website! Or throw in a little sleep command. \n",
    " \n",
    " \n",
    " Recommended route is to save these results to disk so you can regenerate your e-book while tweaking the setup without re-running a bunch of requests.\n",
    " \n",
    "  I'm saving to a file, but saving to disk via [shelve](https://docs.python.org/3.5/library/shelve.html) is a straightforward way to persist structured data in Python."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "import os\n",
    "import requests\n",
    "\n",
    "url = \"http://example.com/%s.html\"\n",
    "build_dir = \"build/\"\n",
    "\n",
    "if not os.path.exists(build_dir):\n",
    "    os.makedirs(build_dir)\n",
    "    \n",
    "source_urls = [url % i for i in range(1,223)]\n",
    "\n",
    "urls = [\n",
    "    (build_dir + \"%s.html\" % i, url % i) for i in range(1,223)\n",
    "]\n",
    "\n",
    "for filename, url in urls:\n",
    "    print(\"Getting \", url)\n",
    "    response = requests.get(url)\n",
    "    with open(filename, 'wb') as f:\n",
    "        f.write(response.content)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Generate e-Book Content Blocks\n",
    "========\n",
    "\n",
    "Using the wonderful [ebooklib](https://github.com/aerkalov/ebooklib/). \n",
    "\n",
    "\n",
    "Firstly, we'll build up chapters per block of content. Epub chapters are simply HTML, so we'll just throw that in directly without much modification.\n",
    "\n",
    "For scraping HTML, [PyQuery](https://pythonhosted.org/pyquery/) is a much nicer solution than BeautifulSoup in my opinion."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "from pyquery import PyQuery as pq\n",
    "from ebooklib import epub\n",
    "\n",
    "\n",
    "def make_chapter(filename):\n",
    "    \"\"\"\n",
    "    Build a chapter from yo\n",
    "    ur content. Edit based off of your needs\n",
    "    \"\"\"\n",
    "    page = pq(filename=build_dir+filename)\n",
    "    content = page.find('#c1')\n",
    "    title = content.find('h1').text()\n",
    "    date = page.find('.s i').text() # adding date to beginning of article \n",
    "\n",
    "    chapter = epub.EpubHtml(\n",
    "        title=date +' : ' + title, # this is what will show up in the TOC\n",
    "        file_name=filename, \n",
    "        lang='en'\n",
    "    )\n",
    "    chapter.content = '<i>' + date +'</i>' + content.html() \n",
    "\n",
    "    return chapter\n",
    "\n",
    "\n",
    "chapters = [\n",
    "    make_chapter(filename) for filename in os.listdir(build_dir)\n",
    "]\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Building the Book\n",
    "========\n",
    "\n",
    "Now that you have the book chapters, time to bind them all together and produce the book!\n",
    "\n",
    "Here we're just setting up the metadata and the chapter ordering. If you're building something fancier, you'll also want to do something like create a cover or re-order chapters.  Check the `ebooklib` documentation for some good ideas.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# create book object\n",
    "book = epub.EpubBook()\n",
    "\n",
    "# set metadata\n",
    "book.set_identifier('some_unique_identifier')\n",
    "book.set_title('The Excellent Blog Posts of your Favorite Author')\n",
    "book.set_language('en')\n",
    "book.add_author('Bloggasaurus Rex')\n",
    "\n",
    "# add the chapters themselve\n",
    "for chapter in chapters:\n",
    "    book.add_item(chapter)\n",
    "    \n",
    "# world's most basic CSS. Work's fine for Kindle\n",
    "style = 'BODY {color: white;}'\n",
    "nav_css = epub.EpubItem(uid=\"style_nav\", file_name=\"style/nav.css\", media_type=\"text/css\", content=style)\n",
    "book.add_item(nav_css)\n",
    "\n",
    "# set the table of contents\n",
    "book.toc = chapters\n",
    "# the spine defines the reading order. Here we'll define it as the chapters from before\n",
    "book.spine = ['nav'] + chapters\n",
    "\n",
    "# add default NCX and Nav file\n",
    "book.add_item(epub.EpubNcx())\n",
    "book.add_item(epub.EpubNav())\n",
    "\n",
    "# finally, write book to file\n",
    "epub.write_epub('final.epub', book, {})"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
	{
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Easy EPUB generation\n",
	"=====================\n",
	" \n",
	" This is a quick script I use to build e-books for my Kindle based off of some content I have. Sometimes it's blogs, sometimes it's some text files I have lying about. It's usually simple HTML that I can easily throw into a book.\n",
	" \n",
	" Feel free to use this for your own needs! Some more explanation in this [blog post](http://rtpg.co/2016/09/17/make-an-ebook.html)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Get the Content\n",
	"=============\n",
	"\n",
	" Try to make sure you get permission before scraping someone's website! Or throw in a little sleep command. \n",
	" \n",
	" \n",
	" Recommended route is to save these results to disk so you can regenerate your e-book while tweaking the setup without re-running a bunch of requests.\n",
	" \n",
	" I'm saving to a file, but saving to disk via [shelve](https://docs.python.org/3.5/library/shelve.html) is a straightforward way to persist structured data in Python."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 1,
	"metadata": {
	"collapsed": false,
	"scrolled": true
	},
	"outputs": [],
	"source": [
	"import os\n",
	"import requests\n",
	"\n",
	"url = \"http://example.com/%s.html\"\n",
	"build_dir = \"build/\"\n",
	"\n",
	"if not os.path.exists(build_dir):\n",
	" os.makedirs(build_dir)\n",
	" \n",
	"source_urls = [url % i for i in range(1,223)]\n",
	"\n",
	"urls = [\n",
	" (build_dir + \"%s.html\" % i, url % i) for i in range(1,223)\n",
	"]\n",
	"\n",
	"for filename, url in urls:\n",
	" print(\"Getting \", url)\n",
	" response = requests.get(url)\n",
	" with open(filename, 'wb') as f:\n",
	" f.write(response.content)\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Generate e-Book Content Blocks\n",
	"========\n",
	"\n",
	"Using the wonderful [ebooklib](https://github.com/aerkalov/ebooklib/). \n",
	"\n",
	"\n",
	"Firstly, we'll build up chapters per block of content. Epub chapters are simply HTML, so we'll just throw that in directly without much modification.\n",
	"\n",
	"For scraping HTML, [PyQuery](https://pythonhosted.org/pyquery/) is a much nicer solution than BeautifulSoup in my opinion."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 16,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"import os\n",
	"\n",
	"from pyquery import PyQuery as pq\n",
	"from ebooklib import epub\n",
	"\n",
	"\n",
	"def make_chapter(filename):\n",
	" \"\"\"\n",
	" Build a chapter from yo\n",
	" ur content. Edit based off of your needs\n",
	" \"\"\"\n",
	" page = pq(filename=build_dir+filename)\n",
	" content = page.find('#c1')\n",
	" title = content.find('h1').text()\n",
	" date = page.find('.s i').text() # adding date to beginning of article \n",
	"\n",
	" chapter = epub.EpubHtml(\n",
	" title=date +' : ' + title, # this is what will show up in the TOC\n",
	" file_name=filename, \n",
	" lang='en'\n",
	" )\n",
	" chapter.content = '<i>' + date +'</i>' + content.html() \n",
	"\n",
	" return chapter\n",
	"\n",
	"\n",
	"chapters = [\n",
	" make_chapter(filename) for filename in os.listdir(build_dir)\n",
	"]\n"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Building the Book\n",
	"========\n",
	"\n",
	"Now that you have the book chapters, time to bind them all together and produce the book!\n",
	"\n",
	"Here we're just setting up the metadata and the chapter ordering. If you're building something fancier, you'll also want to do something like create a cover or re-order chapters. Check the `ebooklib` documentation for some good ideas.\n"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 18,
	"metadata": {
	"collapsed": false
	},
	"outputs": [],
	"source": [
	"# create book object\n",
	"book = epub.EpubBook()\n",
	"\n",
	"# set metadata\n",
	"book.set_identifier('some_unique_identifier')\n",
	"book.set_title('The Excellent Blog Posts of your Favorite Author')\n",
	"book.set_language('en')\n",
	"book.add_author('Bloggasaurus Rex')\n",
	"\n",
	"# add the chapters themselve\n",
	"for chapter in chapters:\n",
	" book.add_item(chapter)\n",
	" \n",
	"# world's most basic CSS. Work's fine for Kindle\n",
	"style = 'BODY {color: white;}'\n",
	"nav_css = epub.EpubItem(uid=\"style_nav\", file_name=\"style/nav.css\", media_type=\"text/css\", content=style)\n",
	"book.add_item(nav_css)\n",
	"\n",
	"# set the table of contents\n",
	"book.toc = chapters\n",
	"# the spine defines the reading order. Here we'll define it as the chapters from before\n",
	"book.spine = ['nav'] + chapters\n",
	"\n",
	"# add default NCX and Nav file\n",
	"book.add_item(epub.EpubNcx())\n",
	"book.add_item(epub.EpubNav())\n",
	"\n",
	"# finally, write book to file\n",
	"epub.write_epub('final.epub', book, {})"
	]
	}
	],
	"metadata": {
	"kernelspec": {
	"display_name": "Python 3",
	"language": "python",
	"name": "python3"
	},
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.5.1"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 1
	}