Skip to content

Instantly share code, notes, and snippets.

@dpapathanasiou
Created October 27, 2012 15:18
Show Gist options
  • Star 6 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save dpapathanasiou/3965052 to your computer and use it in GitHub Desktop.
Save dpapathanasiou/3965052 to your computer and use it in GitHub Desktop.
How to extract just the text from html page articles
"""
A series of functions to extract just the text from html page articles
"""
from lxml import etree
default_encoding = "utf-8"
def newyorker_fp (html_text, page_encoding=default_encoding):
"""For the articles found on the 'Financial Page' section of the New Yorker's website
(e.g., http://www.newyorker.com/talk/financial/2012/06/04/120604ta_talk_surowiecki)
Article text is found in:
<body>
<div id="articletext"> <!-- there is only one of these -->
"""
myparser = etree.HTMLParser(encoding=page_encoding)
tree = etree.HTML(html_text, parser=myparser)
data = []
for node in tree.xpath('//div[@id="articletext"]'):
data.append( ''.join(node.xpath('descendant-or-self::text()')) )
return ''.join(data)
def facta_print (html_text, page_encoding=default_encoding):
"""For the articles found on Facta's free digital site
(e.g., http://facta.co.jp/article/201211043-print.html)
Article text is found in:
<body>
<div id="container">
<div id="contents">
<div class="content"> <!-- there is only one of these -->
"""
myparser = etree.HTMLParser(encoding=page_encoding)
tree = etree.HTML(html_text, parser=myparser)
data = []
for node in tree.xpath('//div[@class="content"]'):
data.append( ''.join(node.xpath('descendant-or-self::text()')) )
return ''.join(data)
@jonashaag
Copy link

How is descendant-or-self::text() different from .text_content()? Btw, did you know there's .cssselect()?

@dpapathanasiou
Copy link
Author

They seem to do the same thing: the definition of descendant-or-self::text() seems to match what .text_content() does, according to its definition on the lxml.html documentation page.

I didn't know about .cssselect() though, so thanks for letting me know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment