Skip to content

Instantly share code, notes, and snippets.

@etrepum
Created May 8, 2013 05:31
Show Gist options
  • Save etrepum/5538443 to your computer and use it in GitHub Desktop.
Save etrepum/5538443 to your computer and use it in GitHub Desktop.
Go home Python, you're drunk.
$ echo 'print repr(u"\ud834\udd20")' > tmp.py; python -c 'import tmp'; python -c 'import tmp'
u'\ud834\udd20'
u'\U0001d120'
@etrepum
Copy link
Author

etrepum commented May 8, 2013

So in the first case Python loads tmp.py from disk, compiles it as a code object, writes it to a pyc file (using the marshal module), and then executes the in-memory code object.

The second case Python loads tmp.pyc from disk, deserializes it to a code object (using the marshal module), and then it executes this code object.

Unpaired surrogates don't round-trip as-is in the marshal module apparently. At least not when they're constants in a code object.

>>> co = compile('print repr(u"\ud834\udd20")', filename='tmp.py', mode='single')
>>> co.co_consts
(u'\ud834\udd20', None)
>>> import marshal
>>> marshal.loads(marshal.dumps(co)).co_consts
(u'\U0001d120', None)

Note that you'll need a wide unicode build of Python 2.x to reproduce this (sys.maxunicode > 65535). They're more common on Linux. It's not the default configure setting anywhere, but Linux distros always compile it this way.

@maxlapshin
Copy link

It again prooves that best unicode support is just a bytearray + some additional accessors over it.

@calroc
Copy link

calroc commented May 9, 2013

Did you file a bug?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment