Last active
August 29, 2015 14:22
-
-
Save jgomo3/030541a1e1232101d2bd to your computer and use it in GitHub Desktop.
An HTML leaves stripping keeping certain elements
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
"""I downloaded the html5 version of PRO_GIT_book_ in order to review | |
with some partners the chapter dedicated to branching (chapter 3). | |
In order to guide a visual review of the chapter, i decided to grab | |
only the images and *code examples* and eliminate the rest. | |
The technique implemented is by striping the leaves off the *HTML DOM* | |
tree, expect those being `<img>`, `<code>` and `<h*>`. The containers | |
elements should not be deleted as it could damage the document | |
structure. | |
This is my quick hack to strip the file `ch03.html` and print in | |
`stdout` the resulting *HTML*. | |
.. _PRO_GIT_book: https://progit2.s3.amazonaws.com/en/2015-05-26-7ac82/progit-en.510.zip | |
""" | |
from lxml import etree | |
namespace = '{http://www.w3.org/1999/xhtml}' | |
keep = [namespace + e for e in ['h{}'.format(algo) for algo in range(1,7)] + ['img', 'code']] | |
tree = etree.parse(open('ch03.html')) | |
body = tree.getroot()[1] | |
kills = False | |
while True: | |
for e in body.xpath("//*"): | |
if len(e) == 0 and e.tag not in keep: | |
parent = e.xpath('..')[0] | |
parent.remove(e) | |
kills = True | |
if not kills: | |
break | |
else: | |
kills = False | |
print etree.tostring(tree, prettry_print=True) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment