Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
How to extract just the text from html page articles
"""
A series of functions to extract just the text from html page articles
"""
from lxml import etree
default_encoding = "utf-8"
def newyorker_fp (html_text, page_encoding=default_encoding):
"""For the articles found on the 'Financial Page' section of the New Yorker's website
(e.g., http://www.newyorker.com/talk/financial/2012/06/04/120604ta_talk_surowiecki)
Article text is found in:
<body>
<div id="articletext"> <!-- there is only one of these -->
"""
myparser = etree.HTMLParser(encoding=page_encoding)
tree = etree.HTML(html_text, parser=myparser)
data = []
for node in tree.xpath('//div[@id="articletext"]'):
data.append( ''.join(node.xpath('descendant-or-self::text()')) )
return ''.join(data)
def facta_print (html_text, page_encoding=default_encoding):
"""For the articles found on Facta's free digital site
(e.g., http://facta.co.jp/article/201211043-print.html)
Article text is found in:
<body>
<div id="container">
<div id="contents">
<div class="content"> <!-- there is only one of these -->
"""
myparser = etree.HTMLParser(encoding=page_encoding)
tree = etree.HTML(html_text, parser=myparser)
data = []
for node in tree.xpath('//div[@class="content"]'):
data.append( ''.join(node.xpath('descendant-or-self::text()')) )
return ''.join(data)
@jonashaag

This comment has been minimized.

Copy link

commented Nov 2, 2012

How is descendant-or-self::text() different from .text_content()? Btw, did you know there's .cssselect()?

@dpapathanasiou

This comment has been minimized.

Copy link
Owner Author

commented Nov 10, 2012

They seem to do the same thing: the definition of descendant-or-self::text() seems to match what .text_content() does, according to its definition on the lxml.html documentation page.

I didn't know about .cssselect() though, so thanks for letting me know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.