Created
May 8, 2013 05:31
-
-
Save etrepum/5538443 to your computer and use it in GitHub Desktop.
Go home Python, you're drunk.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
$ echo 'print repr(u"\ud834\udd20")' > tmp.py; python -c 'import tmp'; python -c 'import tmp' | |
u'\ud834\udd20' | |
u'\U0001d120' |
It again prooves that best unicode support is just a bytearray + some additional accessors over it.
Did you file a bug?
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
So in the first case Python loads tmp.py from disk, compiles it as a code object, writes it to a pyc file (using the marshal module), and then executes the in-memory code object.
The second case Python loads tmp.pyc from disk, deserializes it to a code object (using the marshal module), and then it executes this code object.
Unpaired surrogates don't round-trip as-is in the marshal module apparently. At least not when they're constants in a code object.
Note that you'll need a wide unicode build of Python 2.x to reproduce this (sys.maxunicode > 65535). They're more common on Linux. It's not the default configure setting anywhere, but Linux distros always compile it this way.