Skip to content

Instantly share code, notes, and snippets.

View bradyjiang's full-sized avatar

Brady Jiang bradyjiang

  • replybot.io
  • United States
View GitHub Profile
@bradyjiang
bradyjiang / 20190720-bs4.py
Created July 21, 2019 15:52
Solution 2: BeautifulSoup
from bs4 import BeautifulSoup
# 20190720, from: https://stackoverflow.com/questions/30565404/remove-all-style-scripts-and-html-tags-from-an-html-pagesoup = BeautifulSoup(str_html)
for s in soup(["head"]):
s.decompose()
cleaned_html = str(soup)
@bradyjiang
bradyjiang / cleaner.py
Created July 21, 2019 15:47
Solution 1: lxml.html.clean.Cleaner
from lxml.html.clean import Cleaner
#to prevent Cleaner to replace html with div, leave page_structure alone: http://stackoverflow.com/questions/15556391/lxml-clean-html-replaces-html-tag-with-div
cleaner = Cleaner(page_structure=False)
#according to: http://stackoverflow.com/questions/8554035/remove-all-javascript-tags-and-style-tags-from-html-with-python-and-the-lxml-mod
#Cleaner is a better general solution to the problem than using strip_elements, because in cases like this you want to strip out more than just the <script> tag; you also want to get rid of things like onclick=function() attributes on other tags.
cleaner.javascript=True
cleaner.scripts=True
#turn this on in the future if necessary
#cleaner.style=True