Skip to content

Instantly share code, notes, and snippets.

@richmilne
Created March 22, 2024 09:10
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save richmilne/981e5a9f3a40dde4e53c1f9179a2829a to your computer and use it in GitHub Desktop.
Save richmilne/981e5a9f3a40dde4e53c1f9179a2829a to your computer and use it in GitHub Desktop.
In both Python 2 and 3, unambiguously define a sentinel Unicode string (&Ж中𐍆).
# encoding: utf8
"""
Unambiguously define, from hex data, a sentinel Unicode string (&Ж中𐍆).
The string contains a 1-, 2-, 3-and 4-byte Unicode character. Useful for finding
encoding errors in editors and transfer systems!
"""
import sys
py2 = True if sys.version_info.major == 2 else False
codes = "26 d096 e4b8ad f0908d86"
bytes_ = [codes[i:i+2] for i in range(0, len(codes), 2)]
bytes_ = [int(byte, 16) for byte in bytes_ if byte.strip()]
if py2:
bytes_ = [chr(char) for char in bytes_]
unistr = unicode(''.join(bytes_), 'utf8', 'strict')
unistr = unistr.encode('utf8')
print unistr
else:
unistr = bytes(bytes_).decode('utf8')
print(type(unistr), unistr)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment