Skip to content

Instantly share code, notes, and snippets.

@jgomo3
Last active August 29, 2015 14:22
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jgomo3/030541a1e1232101d2bd to your computer and use it in GitHub Desktop.
Save jgomo3/030541a1e1232101d2bd to your computer and use it in GitHub Desktop.
An HTML leaves stripping keeping certain elements
"""I downloaded the html5 version of PRO_GIT_book_ in order to review
with some partners the chapter dedicated to branching (chapter 3).
In order to guide a visual review of the chapter, i decided to grab
only the images and *code examples* and eliminate the rest.
The technique implemented is by striping the leaves off the *HTML DOM*
tree, expect those being `<img>`, `<code>` and `<h*>`. The containers
elements should not be deleted as it could damage the document
structure.
This is my quick hack to strip the file `ch03.html` and print in
`stdout` the resulting *HTML*.
.. _PRO_GIT_book: https://progit2.s3.amazonaws.com/en/2015-05-26-7ac82/progit-en.510.zip
"""
from lxml import etree
namespace = '{http://www.w3.org/1999/xhtml}'
keep = [namespace + e for e in ['h{}'.format(algo) for algo in range(1,7)] + ['img', 'code']]
tree = etree.parse(open('ch03.html'))
body = tree.getroot()[1]
kills = False
while True:
for e in body.xpath("//*"):
if len(e) == 0 and e.tag not in keep:
parent = e.xpath('..')[0]
parent.remove(e)
kills = True
if not kills:
break
else:
kills = False
print etree.tostring(tree, prettry_print=True)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment