Skip to content

Instantly share code, notes, and snippets.

@olasitarska
Created November 18, 2012 10:11
Show Gist options
  • Save olasitarska/4104455 to your computer and use it in GitHub Desktop.
Save olasitarska/4104455 to your computer and use it in GitHub Desktop.
Builds epub book out of Paul Graham's essays.
# -*- coding: utf-8 -*-
"""
Builds epub book out of Paul Graham's essays: http://paulgraham.com/articles.html
Author: Ola Sitarska <ola@sitarska.com>
Copyright: Licensed under the GPL-3 (http://www.gnu.org/licenses/gpl-3.0.html)
This script requires python-epub-library: http://code.google.com/p/python-epub-builder/
"""
import re, ez_epub, urllib2, genshi
from BeautifulSoup import BeautifulSoup
def addSection(link, title):
if not 'http' in link:
page = urllib2.urlopen('http://www.paulgraham.com/'+link).read()
soup = BeautifulSoup(page)
soup.prettify()
else:
page = urllib2.urlopen(link).read()
section = ez_epub.Section()
try:
section.title = title
print section.title
if not 'http' in link:
font = str(soup.findAll('table', {'width':'455'})[0].findAll('font')[0])
if not 'Get funded by' in font and not 'Watch how this essay was' in font and not 'Like to build things?' in font and not len(font)<100:
content = font
else:
content = ''
for par in soup.findAll('table', {'width':'455'})[0].findAll('p'):
content += str(par)
for p in content.split("<br /><br />"):
section.text.append(genshi.core.Markup(p))
#exception for Subject: Airbnb
for pre in soup.findAll('pre'):
section.text.append(genshi.core.Markup(pre))
else:
for p in str(page).replace("\n","<br />").split("<br /><br />"):
section.text.append(genshi.core.Markup(p))
except:
pass
return section
book = ez_epub.Book()
book.title = "Paul Graham's Essays"
book.authors = ['Paul Graham']
page = urllib2.urlopen('http://www.paulgraham.com/articles.html').read()
soup = BeautifulSoup(page)
soup.prettify()
links = soup.findAll('table', {'width': '455'})[1].findAll('a')
sections = []
for link in links:
sections.append(addSection(link['href'], link.text))
book.sections = sections
book.make(book.title)
@dwinston
Copy link

I'm getting an error about an invalid java call, I suppose the "subprocess.call(['java', '-jar', checkerPath, epubPath], shell = True)" in epub.py. I have java installed. Details: java version "1.6.0_24"
OpenJDK Runtime Environment (IcedTea6 1.11.5) (6b24-1.11.5-0ubuntu1~10.04.2)
OpenJDK Client VM (build 20.0-b12, mixed mode, sharing)

Any ideas?

@sarp
Copy link

sarp commented Nov 23, 2012

On iPhone 5, I get "This page contains the following errors: error on line 13 at column 7: Opening and ending tag mismatch: font line 0 and p" this error when I open the generated epub file in iBooks

@c10b10
Copy link

c10b10 commented Feb 16, 2013

What deps does this have?

@gsdatta
Copy link

gsdatta commented Aug 27, 2015

Quick fix - it should be width 435 now.

@SergeAx
Copy link

SergeAx commented May 30, 2016

One should change '455' to '435' at lines 28, 33 and 59 for this code to work.

@malthejorgensen
Copy link

In order to get valid HTML (which is what .epub contains) you also need to remove the <font> tags (beyond just changing the table width to 435 as @gsdatta and @SergeAx said).

As of today April 12th, 2017 only 4 forks of this gist actually changed the code:

Forks that fix the table width and removes <font> tags:

Forks that fix the table width (but doesn't remove <font> tags):

Very interesting fork – very large script that does a lot of stuff and is basically a rewrite:

@alexshevchuk
Copy link

Added a fork with the

  • new modern libs urllib3 and bs4
  • width fix
  • minor syntax changes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment