Skip to content

Instantly share code, notes, and snippets.

@harshavardhana
Created November 29, 2012 09:31
Show Gist options
  • Save harshavardhana/4167808 to your computer and use it in GitHub Desktop.
Save harshavardhana/4167808 to your computer and use it in GitHub Desktop.
Clean HTML
def clean_html(html):
""" Remove HTML markup from the given string. """
# remove inline JavaScript / CSS
x = re.sub(r'(?is)<(script|style).*?>.*?(</\1>)', '', html.strip())
# remove html comments. must be done before removing regular tags since comments can contain '>' characters.
x = re.sub(r'(?s)<!--(.*?)-->[\n]?', '', x)
# remove the remaining tags
x = re.sub(r'(?s)<.*?>', ' ', x)
# remove html entities
x = remove_entities(x)
# clean up whitespace
x = re.sub(r'[ ]+', ' ', x)
return x
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment