Skip to content

Instantly share code, notes, and snippets.

@yubessy
Created June 29, 2014 03:03
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save yubessy/21f23594a97468298abc to your computer and use it in GitHub Desktop.
Save yubessy/21f23594a97468298abc to your computer and use it in GitHub Desktop.
HTMLをクリーンアップ
# -*- coding: utf-8 -*-
# stdlib
import re
LEFT_SPACES = re.compile(r'\s+<')
RIGHT_SPACES = re.compile(r'>\s+')
SCRIPT_TAG = re.compile(r'<script[^>]*>.*?</script>')
COMMENT = re.compile(r'<!--[\s\S]*?-->')
def cleanup(html):
"""
htmlから余計な要素を除去する
"""
html = html.replace('\n', ' ')
html = LEFT_SPACES.sub(lambda t: ' <', html)
html = RIGHT_SPACES.sub(lambda t: '> ', html)
html = SCRIPT_TAG.sub(lambda t: '', html)
html = COMMENT.sub(lambda t: '', html)
return html
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment