Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Clean HTML Pages
import lxml.html.clean as clean
from BeautifulSoup import BeautifulSoup
input_file = 'input.html'
output_file = 'output.html'
orig_content = open(input_file, 'rw').read()
soup = BeautifulSoup(orig_content)
result = str(soup)
strip = clean.Cleaner(meta = True, style = True, page_structure = True, remove_tags = ['FONT', 'font','span'])
content = strip.clean_html(result)
new_content = open(output_file, 'w')
new_content.write(content)
new_content.close()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.