Skip to content

Instantly share code, notes, and snippets.

Created August 12, 2013 19:28
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
Star You must be signed in to star a gist
What would you like to do?
#!/usr/bin/env python
Crawl O'Reilly's Book Catalog, extract RDFa about the books, and stash away
triples in a rdflib BerkeleyDB store.
You will need the trunk version of rdflib installed, or otherwise available.
You also will need html5lib for the lax, tagsoup parsing--O'Reilly's html
for its book pages isn't well-formed at the moment.
import re
import urllib
from rdflib.graph import ConjunctiveGraph
from rdflib.term import URIRef
catalog_urls = [
graph = ConjunctiveGraph('Sleepycat')'store', create=True)
for catalog_url in catalog_urls:
html = urllib.urlopen(catalog_url).read()
for book_url in re.findall(r'"(\d+/)"', html):
# TODO: make this smarter, crawl if running at a different time
if URIRef(book_url) in graph.subjects():
print "fetching url=%s [current graph size=%s]" % (book_url, len(graph))
# some urls in the catalog 404 believe it or not
graph.parse(location=book_url, format='rdfa', lax=True)
except Exception, e:
print e
graph.serialize(open('catalog.rdf', 'w'))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment