Skip to content

Instantly share code, notes, and snippets.

@pekkaklarck
Last active December 23, 2015 23:39
Show Gist options
  • Save pekkaklarck/6711623 to your computer and use it in GitHub Desktop.
Save pekkaklarck/6711623 to your computer and use it in GitHub Desktop.
Python implementation of UTF-8 decode algorithm by `decode("Bj\xc3\xb6rn H\xc3\xb6hrmann")` explained at http://bjoern.hoehrmann.de/utf-8/decoder/dfa/. Notes: 1) This is based on the slightly performance enhanced version closer to the end of the article. 2) My C skills are rather limited so it's possible that there are bugs. Simple test strings …
UTF8_ACCEPT = 0
UTF8_REJECT = 12
UTF8D = (
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,
7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7, 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
8,8,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
10,3,3,3,3,3,3,3,3,3,3,3,3,4,3,3, 11,6,6,6,5,8,8,8,8,8,8,8,8,8,8,8,
0,12,24,36,60,96,84,12,12,12,48,72, 12,12,12,12,12,12,12,12,12,12,12,12,
12, 0,12,12,12,12,12, 0,12, 0,12,12, 12,24,12,12,12,12,12,24,12,24,12,12,
12,12,12,12,12,12,12,24,12,12,12,12, 12,24,12,12,12,12,12,12,12,24,12,12,
12,12,12,12,12,12,12,36,12,36,12,12, 12,36,12,12,12,12,12,36,12,36,12,12,
12,36,12,12,12,12,12,12,12,12,12,12,
)
def decode(string):
codep = 0
state = UTF8_ACCEPT
for char in string:
byte = ord(char)
type = UTF8D[byte]
if state != UTF8_ACCEPT:
codep = (byte & 0x3f) | (codep << 6)
else:
codep = (0xff >> type) & (byte)
state = UTF8D[256 + state + type]
if state == UTF8_ACCEPT:
print codep, unichr(codep)
decode("Bj\xc3\xb6rn H\xc3\xb6hrmann")
decode(u'\u2603'.encode('utf-8'))
decode(u'\U0001F649'.encode('utf-8'))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment