Skip to content

Instantly share code, notes, and snippets.

@ZoeyYoung
Last active December 20, 2015 05:28
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ZoeyYoung/6078034 to your computer and use it in GitHub Desktop.
Save ZoeyYoung/6078034 to your computer and use it in GitHub Desktop.
Python: 正则表达式, 去除HTML标签中无用属性
import re
bad_attrs = ['width', 'height', 'style', '[-a-z]*color',
'background[-a-z]*', 'on*']
single_quoted = "'[^']+'"
double_quoted = '"[^"]+"'
non_space = '[^ "\'>]+'
cstr = ("<" # open
"([^>]+) " # prefix
"(?:%s) *" % ('|'.join(bad_attrs),) + # undesirable attributes
'= *(?:%s|%s|%s)' % (non_space, single_quoted, double_quoted) + # value
"([^>]*)" + # postfix
">")
htmlstrip = re.compile("<" # open
"([^>]+) " # prefix
"(?:%s) *" % ('|'.join(bad_attrs),) + # undesirable attributes
'= *(?:%s|%s|%s)' % (non_space, single_quoted, double_quoted) + # value
"([^>]*)" # postfix
">", # end
re.I)
def clean_attributes(html):
"""移除HTML标签中无用的属性, 即上面的bad_attrs
例如: <div id="main" class="content" style="font-size:18px;">content</div>
变成: <div id="main" class="content">content</div>
"""
while htmlstrip.search(html):
html = htmlstrip.sub(r'<\1\2>', html)
return html
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment