jgomo3/html2pres.py

## html2pres.py
"""I downloaded the html5 version of PRO_GIT_book_ in order to review
with some partners the chapter dedicated to branching (chapter 3).

In order to guide a visual review of the chapter, i decided to grab
only the images and *code examples* and eliminate the rest.

The technique implemented is by striping the leaves off the *HTML DOM*
tree, expect those being `<img>`, `<code>` and `<h*>`. The containers
elements should not be deleted as it could damage the document
structure.

This is my quick hack to strip the file `ch03.html` and print in
`stdout` the resulting *HTML*.

.. _PRO_GIT_book: https://progit2.s3.amazonaws.com/en/2015-05-26-7ac82/progit-en.510.zip

"""

from lxml import etree

namespace = '{http://www.w3.org/1999/xhtml}'
keep = [namespace + e for e in ['h{}'.format(algo) for algo in range(1,7)] + ['img', 'code']]

tree = etree.parse(open('ch03.html'))
body = tree.getroot()[1]

kills = False
while True:
    for e in body.xpath("//*"):
        if len(e) == 0 and e.tag not in keep:
            parent = e.xpath('..')[0]
            parent.remove(e)
            kills = True
    if not kills:
        break
    else:
        kills = False
print etree.tostring(tree, prettry_print=True)
	"""I downloaded the html5 version of PRO_GIT_book_ in order to review
	with some partners the chapter dedicated to branching (chapter 3).

	In order to guide a visual review of the chapter, i decided to grab
	only the images and code examples and eliminate the rest.

	The technique implemented is by striping the leaves off the HTML DOM
	tree, expect those being `<img>`, `<code>` and `<h*>`. The containers
	elements should not be deleted as it could damage the document
	structure.

	This is my quick hack to strip the file `ch03.html` and print in
	`stdout` the resulting HTML.

	.. _PRO_GIT_book: https://progit2.s3.amazonaws.com/en/2015-05-26-7ac82/progit-en.510.zip

	"""

	from lxml import etree

	namespace = '{http://www.w3.org/1999/xhtml}'
	keep = [namespace + e for e in ['h{}'.format(algo) for algo in range(1,7)] + ['img', 'code']]

	tree = etree.parse(open('ch03.html'))
	body = tree.getroot()[1]

	kills = False
	while True:
	for e in body.xpath("//*"):
	if len(e) == 0 and e.tag not in keep:
	parent = e.xpath('..')[0]
	parent.remove(e)
	kills = True
	if not kills:
	break
	else:
	kills = False
	print etree.tostring(tree, prettry_print=True)