Skip to content

Instantly share code, notes, and snippets.

@tshrinivasan
Created March 8, 2017 03:12
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save tshrinivasan/3211e1cb38bc8c3cadd3f6d020db77f4 to your computer and use it in GitHub Desktop.
Save tshrinivasan/3211e1cb38bc8c3cadd3f6d020db77f4 to your computer and use it in GitHub Desktop.
Clean HTML Pages
import lxml.html.clean as clean
from BeautifulSoup import BeautifulSoup
input_file = 'input.html'
output_file = 'output.html'
orig_content = open(input_file, 'rw').read()
soup = BeautifulSoup(orig_content)
result = str(soup)
strip = clean.Cleaner(meta = True, style = True, page_structure = True, remove_tags = ['FONT', 'font','span'])
content = strip.clean_html(result)
new_content = open(output_file, 'w')
new_content.write(content)
new_content.close()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment