Skip to content

Instantly share code, notes, and snippets.

@bradyjiang
Created July 21, 2019 15:47
Show Gist options
  • Save bradyjiang/c99a107fef74f5c87e5bb7cc1ece93e4 to your computer and use it in GitHub Desktop.
Save bradyjiang/c99a107fef74f5c87e5bb7cc1ece93e4 to your computer and use it in GitHub Desktop.
Solution 1: lxml.html.clean.Cleaner
from lxml.html.clean import Cleaner
#to prevent Cleaner to replace html with div, leave page_structure alone: http://stackoverflow.com/questions/15556391/lxml-clean-html-replaces-html-tag-with-div
cleaner = Cleaner(page_structure=False)
#according to: http://stackoverflow.com/questions/8554035/remove-all-javascript-tags-and-style-tags-from-html-with-python-and-the-lxml-mod
#Cleaner is a better general solution to the problem than using strip_elements, because in cases like this you want to strip out more than just the <script> tag; you also want to get rid of things like onclick=function() attributes on other tags.
cleaner.javascript=True
cleaner.scripts=True
#turn this on in the future if necessary
#cleaner.style=True
cleaner.kill_tags=["base"]
cleaned_html=cleaner.clean_html(str_html)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment