Skip to content

Instantly share code, notes, and snippets.

@rtpg
Last active September 17, 2016 10:40
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rtpg/de6bcde42d3628d6c486d7ff7b761b8d to your computer and use it in GitHub Desktop.
Save rtpg/de6bcde42d3628d6c486d7ff7b761b8d to your computer and use it in GitHub Desktop.
My EPUB generation script
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Easy EPUB generation\n",
"=====================\n",
" \n",
" This is a quick script I use to build e-books for my Kindle based off of some content I have. Sometimes it's blogs, sometimes it's some text files I have lying about. It's usually simple HTML that I can easily throw into a book.\n",
" \n",
" Feel free to use this for your own needs! Some more explanation in this [blog post](http://rtpg.co/2016/09/17/make-an-ebook.html)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Get the Content\n",
"=============\n",
"\n",
" Try to make sure you get permission before scraping someone's website! Or throw in a little sleep command. \n",
" \n",
" \n",
" Recommended route is to save these results to disk so you can regenerate your e-book while tweaking the setup without re-running a bunch of requests.\n",
" \n",
" I'm saving to a file, but saving to disk via [shelve](https://docs.python.org/3.5/library/shelve.html) is a straightforward way to persist structured data in Python."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [],
"source": [
"import os\n",
"import requests\n",
"\n",
"url = \"http://example.com/%s.html\"\n",
"build_dir = \"build/\"\n",
"\n",
"if not os.path.exists(build_dir):\n",
" os.makedirs(build_dir)\n",
" \n",
"source_urls = [url % i for i in range(1,223)]\n",
"\n",
"urls = [\n",
" (build_dir + \"%s.html\" % i, url % i) for i in range(1,223)\n",
"]\n",
"\n",
"for filename, url in urls:\n",
" print(\"Getting \", url)\n",
" response = requests.get(url)\n",
" with open(filename, 'wb') as f:\n",
" f.write(response.content)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Generate e-Book Content Blocks\n",
"========\n",
"\n",
"Using the wonderful [ebooklib](https://github.com/aerkalov/ebooklib/). \n",
"\n",
"\n",
"Firstly, we'll build up chapters per block of content. Epub chapters are simply HTML, so we'll just throw that in directly without much modification.\n",
"\n",
"For scraping HTML, [PyQuery](https://pythonhosted.org/pyquery/) is a much nicer solution than BeautifulSoup in my opinion."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import os\n",
"\n",
"from pyquery import PyQuery as pq\n",
"from ebooklib import epub\n",
"\n",
"\n",
"def make_chapter(filename):\n",
" \"\"\"\n",
" Build a chapter from yo\n",
" ur content. Edit based off of your needs\n",
" \"\"\"\n",
" page = pq(filename=build_dir+filename)\n",
" content = page.find('#c1')\n",
" title = content.find('h1').text()\n",
" date = page.find('.s i').text() # adding date to beginning of article \n",
"\n",
" chapter = epub.EpubHtml(\n",
" title=date +' : ' + title, # this is what will show up in the TOC\n",
" file_name=filename, \n",
" lang='en'\n",
" )\n",
" chapter.content = '<i>' + date +'</i>' + content.html() \n",
"\n",
" return chapter\n",
"\n",
"\n",
"chapters = [\n",
" make_chapter(filename) for filename in os.listdir(build_dir)\n",
"]\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Building the Book\n",
"========\n",
"\n",
"Now that you have the book chapters, time to bind them all together and produce the book!\n",
"\n",
"Here we're just setting up the metadata and the chapter ordering. If you're building something fancier, you'll also want to do something like create a cover or re-order chapters. Check the `ebooklib` documentation for some good ideas.\n"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# create book object\n",
"book = epub.EpubBook()\n",
"\n",
"# set metadata\n",
"book.set_identifier('some_unique_identifier')\n",
"book.set_title('The Excellent Blog Posts of your Favorite Author')\n",
"book.set_language('en')\n",
"book.add_author('Bloggasaurus Rex')\n",
"\n",
"# add the chapters themselve\n",
"for chapter in chapters:\n",
" book.add_item(chapter)\n",
" \n",
"# world's most basic CSS. Work's fine for Kindle\n",
"style = 'BODY {color: white;}'\n",
"nav_css = epub.EpubItem(uid=\"style_nav\", file_name=\"style/nav.css\", media_type=\"text/css\", content=style)\n",
"book.add_item(nav_css)\n",
"\n",
"# set the table of contents\n",
"book.toc = chapters\n",
"# the spine defines the reading order. Here we'll define it as the chapters from before\n",
"book.spine = ['nav'] + chapters\n",
"\n",
"# add default NCX and Nav file\n",
"book.add_item(epub.EpubNcx())\n",
"book.add_item(epub.EpubNav())\n",
"\n",
"# finally, write book to file\n",
"epub.write_epub('final.epub', book, {})"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.1"
}
},
"nbformat": 4,
"nbformat_minor": 1
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment