Skip to content

Instantly share code, notes, and snippets.

Created February 10, 2016 04:27
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save anonymous/11019e4cba687bd846db to your computer and use it in GitHub Desktop.
Save anonymous/11019e4cba687bd846db to your computer and use it in GitHub Desktop.
Convert HTML Entities to human (and machine) readable characters, also convert Thai PUA to normal code points
import html
import re
pua = {
'63233': 'ิ',
'63234': 'ี',
'63235': 'ึ',
'63236': 'ื',
'63237': '่',
'63238': '้',
'63242': '่',
'63243': '้',
'63246': '์',
'63248': 'ั',
'63250': '็',
'63251': '่',
'63252': '้'
}
def thaiPUA(matchobj):
return pua[matchobj.group(1)]
p = re.compile(r'\&\#(\d{5,})\;')
outputf = open('new.html', 'w')
inputf = open('constitution-draft-20160129.html', 'r')
for line in inputf:
text = p.sub(thaiPUA, line)
outputf.writelines(html.unescape(text))
inputf.close()
outputf.close()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment