Skip to content

Instantly share code, notes, and snippets.

View braveulysses's full-sized avatar

Jacob Childress braveulysses

View GitHub Profile
@braveulysses
braveulysses / sanitize_html.py
Created May 29, 2009 20:25
HTML sanitization using Python and BeautifulSoup
def sanitize(untrusted_html, additional_tags=None):
"""Strips potentially harmful tags and attributes from HTML, but preserves
all tags in a whitelist.
Passing the list additional_tags will add the specified tags to the whitelist.
The sanitizer does NOT encode reserved characters into XML entities. It is up
to the template code, if any, to take care of that.
Based on the work of:
@braveulysses
braveulysses / strip_tags.py
Created May 29, 2009 20:24
Strip HTML tags using BeautifulSoup
def strip(untrusted_html):
"""Strips out all tags from untrusted_html, leaving only text.
Converts XML entities to Unicode characters. This is desirable because it
reduces the likelihood that a filter further down the text processing chain
will double-encode the XML entities."""
soup = BeautifulStoneSoup(untrusted_html, convertEntities=BeautifulStoneSoup.ALL_ENTITIES)
safe_html = ''.join(soup.findAll(text=True))
return safe_html