Skip to content

Instantly share code, notes, and snippets.

@joffilyfe
Last active April 22, 2020 12:55
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save joffilyfe/e93a3ab269e18eddf895dcee1cdd6490 to your computer and use it in GitHub Desktop.
Save joffilyfe/e93a3ab269e18eddf895dcee1cdd6490 to your computer and use it in GitHub Desktop.
import re
import html
_charref = re.compile(r'&(#[0-9]+;?'
r'|#[xX][0-9a-fA-F]+;?'
r'|[^\t\n\f <&#;]{1,32};?)')
def html_safe_decode(string, forbidden=["&lt;", "&gt;", "&amp;"]):
"""
>>> "26. Cohen J. The Earth is Round (p.05&gt;Am Psychol 1994; 49: 997-1003"
>>> html_safe_decode("26. Cohen J. The Earth is Round (p.05&gt;Am Psychol 1994; 49: 997-1003")
"""
if '&' not in string:
return string
def replace_charref(s):
s = "&" + s.group(1)
if s in forbidden:
return s
return html.unescape(s)
return _charref.sub(replace_charref, string)
@robertatakenaka
Copy link

robertatakenaka commented Apr 22, 2020

html_safe_decode("&ccedil;&atilde;&lt;&amp;&copy;&gt;ˆ&#091;&#93;&#60;")
'çã&lt;&amp;©&gt;ˆ[]<'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment