Skip to content

Instantly share code, notes, and snippets.

@mpevner
Last active December 13, 2015 21:58
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save mpevner/4980825 to your computer and use it in GitHub Desktop.
Save mpevner/4980825 to your computer and use it in GitHub Desktop.
gutenberg catalog thing
from lxml.etree import parse
catalog = parse('catalog.rdf')
book_tag = '{http://www.gutenberg.org/rdfterms/}etext'
books = catalog.findall(book_tag)
file_tag = '{http://www.gutenberg.org/rdfterms/}file'
files = catalog.findall(file_tag)
#from here you can just do book = catalog[].getchildren() and start working on subelements ie: the book data
#books[].values() returns bookid
#further, files[].values() gets you filename
#and files[].getchildren() has isformatOf which the .values() is the extext the file is for
#eg: files[].getchildren[4].values() returns '#etext1'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment