Skip to content

Instantly share code, notes, and snippets.

@mittenchops
Created February 9, 2014 18:45
Show Gist options
  • Save mittenchops/8904098 to your computer and use it in GitHub Desktop.
Save mittenchops/8904098 to your computer and use it in GitHub Desktop.
HTML Cleaner
import requests
from lxml.html.clean import Cleaner
url = "http://en.wikipedia.org/wiki/Zipf%27s_law"
html = requests.get(url).text
cleaner = Cleaner(allow_tags=[''], remove_unknown_tags=False, remove_tags=['<div>','</div>'])
cleaner.scripts = True
cleaner.page_structure = True
cleaner.javascript = True
cleaner.style = True
cleaner.comments = True
cleaned_text = cleaner.clean_html(html).replace("\t"," ").replace("<div>",'').replace("</div>","")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment