Unicode in Python 2 and 3
My notes on unicode handling in Python 2 and 3, with runnable tests.
So far this is very heavily based on Ned Batchelder's Pragmatic Unicode, it's basically my notes on that presentation, with the examples from that presentation in the form of runnable tests. I may add more to these notes in time though.
Requires
Running the tests
$ tox
Notes
Unicode
Unicode is a 1:1 mapping of 1.1 million code points (integers) to characters, starting with the ASCII characters.
For some reason instead of just being written as integers unicode code points
are written in hex with a U+
prefix, for example: U+2119
. Each code point
also has an all-caps ASCII name like DOUBLE-STRUCK CAPITAL P
.
In literal unicode strings in Python a unicode code point like U+2119
can
be written with \u
like "\u2119"
:
>>> print(u"\u2119")
ℙ
But this doesn't work in byte strings, the escape sequence is interpreted literally instead of as an escape sequence:
>>> print(b"\u2119")
\u2119
(Byte strings have the byte hexadecimal escape, \x
, instead.)
(This is the same in either Python 2 or 3.)
Some code points need a capital \U
and 8 hex digits: "\Uabcd1234"
.
Not all of the code points in unicode have been assigned a character yet.
Encodings
Strings on disk or on the network etc are strings of bytes. Bytes need to be decoded to turn them into characters, and to do that you need to know what encoding the bytes are in (ASCII, UTF-8, ...).
UTF-8 is by far the most popular encoding for bytes.
Some encodings (including UTF-8) are actually part of the unicode standard.
Many encodings have ASCII as a subset. For example an ASCII byte string is also by definition a UTF-8 byte string, can be decoded using UTF-8 and will produce the same unicode string.
5 Facts Of Life
Ned Batchelder's 5 Facts Of Life about unicode in Python:
-
Input and output to and from a program is always bytes, not unicode. (Files on disk, network connections, ...).
-
If it involves people using text to communicate, your program needs to support unicode. It needs to be able to handle more than just ASCII.
-
You need to deal with both bytes and unicode. You can't write a program that only uses one or the other. You have to use both and explicitly convert between them as needed.
-
You can't infer the encoding of a byte string from the byte string itself. You have to be told out of band.
So where are you gonna get the encoding from?
Content-Type
HTTP header, HTML<meta>
tags, theencoding
attribute on<?xml>
declarations,-*- coding:
comments in source code files, file type specifications, ...You can also guess the encoding but it will only be a guess - decoding might succeed but produce the wrong characters (a false positive).
-
Declared encodings can be wrong, which will either result in false positives (and garbage characters appearing in your app) or
UnicodeDecodeError
s.
3 Pain Relief Tips
And his 3 unicode pain relief tips:
-
Unicode sandwich. Bytes on the outsides, unicode on the inside. Decode / encode at the edges of your program, as soon as possible / as late as possible. (This is often done for your by libraries and frameworks, e.g. Django decodes byte strings and gives you them as unicode already, so does the
json
lib, etc.)On the insides of your program it should be all unicode, never a byte string.
If you find a byte string in your code trace it all the way back to the source - where did the byte string come from? And then change your code to decode it at the source.
-
Always explicitly know whether a string you have is unicode or a byte string and, if it's a byte string, what encoding it's in. Never just say "it's a string".
If you have to you can use
type()
to ask whether something's a byte or unicode string.You can't tell from looking at a stream of bytes what encoding it's in. You have to be told the encoding, e.g. in the
Content-Type
HTTP header or HTML<meta>
tag etc, or based on the spec of the content type you're reading.Sometimes the source tells you the wrong encoding. When this happens you're screwed. If it decodes successfully using the wrong encoding it will probably have decoded to the wrong unicode characters, but you have no way of knowing. If it raises on decoding then you can try falling back on other encodings, but again one of them may decode successfully but you'll have no way of knowing if it's the right characters. (You may be able to apply some heuristics - e.g. check only for expected characters such as ASCII ones, reject any unicode strings that come out with garbage characters in them.)
-
Throughout your test suite have tests that throw non-ASCII characters into your code.
Python 2 vs Python 3
Unicode is the biggest difference between Python 2 and 3:
-
In Python 2 the
str
type is for byte strings andunicode
is for unicode strings, andbytes
is an alias forstr
. In Python 3str
is unicode strings andbytes
is byte strings (and there's nounicode
). -
Literal strings are ASCII-encoded byte strings by default in Python 2, unicode in Python 3.
But in Python 2:
- A
# -*- coding: utf-8 -*-
comment at the top of the file turns the literal byte strings into UTF-8 instead of ASCII. - A
from __future__ import unicode_literals
turns literal strings into unicode instead of byte strings.
- A
-
In either Python 2 or Python 3 you can force a literal byte string with
b"..."
or force a literal unicode string withu"..."
. -
Python 2 automatically decodes byte strings using the ASCII codec (
sys.getdefaultencoding()
), or sometimes automatically encodes unicode strings using ASCII, whenever you try to do an operation that combines byte strings and unicode strings (concatenating strings, string formatting, comparing strings for equality, dictionary indexing using strings as keys, printing strings, ...)This means that all sorts of operations can produce
UnicodeDecodeError
if you pass byte strings and unicode strings into them.Because ASCII is used it's not likely that these implicit decodes will succeed without error but produce the wrong string (a false positive) - ASCII is a subset of most encodings. Instead you'll get
UnicodeDecodeError
s when non-ASCII characters appear.Python 3 just raises
TypeError
, it doesn't let you combine byte strings and unicode strings. -
A unicode string and a byte string can be equal in Python 2 if implicit encoding of the unicode string succeeds and they turn out to have the same bytes.
In Python 3 a unicode string and a byte string are never equal.
This can for example cause dictionary lookups that worked in Python 2 to fail in Python 3.
-
In Python 2 byte strings have an encode method!
Even though they're already encoded. Python 2 will implicitly decode the byte string using ASCII to get a unicode string and then encode that.
In Python 3 byte strings don't have an encode method.
-
In Python 2 reading a string from a file using
"r"
mode returned a byte string. In Python 3"r"
mode returns unicode strings, but"rb"
mode still returns byte strings.open("hello.txt", "r").read() open("hello.txt", "rb").read()
The
"r"
mode in Python 3 useslocale.getpreferredencoding()
for the implicit decoding from bytes to unicode. You can override this with theencoding
argument toopen()
.open()
doesn't have anyencoding
parameter in Python 2.So
open()
in"r"
mode can raiseUnicodeDecodeError
in Python 3, it couldn't in Python 2.You should always specify an encoding when reading text from file in Python 3.
Unicode literals
- You can use
\u2119
to insert any unicode character into a unicode literal by code point.
Unicode strings
.encode(encoding)
. Raises UnicodeEncodeError
.
Byte strings
.decode(encoding)
. Raises UnicodeDecodeError
.
Printing strings
print
uses your terminal's encoding, probably UTF-8, I think.
stdin and stdout
These are "pre-opened files" with a certain encoding that depends on the system.
What encoding are literal byte strings in?
Note this is all Python 2. In Python 3 literal non-ASCII chars are not allowed in literal byte strings: https://docs.python.org/3.3/reference/lexical_analysis.html#string-and-bytes-literals
By default literal byte strings are ASCII-encoded. Save this program to an
encoding.py
file:
#!/usr/bin/env python2
print "ℙƴ☂ℌøἤ"
Running ./encoding.py
will crash with:
SyntaxError: Non-ASCII character '\xe2' in file foo.py on line 1, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
Since string literals are ASCII by default you can't put non-ASCII characters in them.
Using a unicode string literal (print u"ℙƴ☂ℌøἤ"
) will also crash with the
same error.
You can put non-ASCII characters in string literals if you escape them. This will work, using UTF-8 escape codes:
#!/usr/bin/env python2
print "\xe2\x84\x99\xc6\xb4\xe2\x98\x82\xe2\x84\x8c\xc3\xb8\xe1\xbc\xa4"
Prints out ℙƴ
Or, better, you can use escaped unicode code points, this will print out ℙƴu
(this should work regardless of the encoding of the terminal
you run it from I think):
#!/usr/bin/env python2
print u"\u2119\u01b4\u2602\u210c\u1f24"
https://docs.python.org/3.6/reference/lexical_analysis.html#encoding-declarations
But if you put a # -*- coding: utf-8 -*-
comment at the top of your
file then Python understands string literals to be UTF-8, so this program will
work:
#!/usr/bin/env python2
# -*- coding: utf-8 -*-
print "ℙƴ☂ℌøἤ"
It prints out ℙƴ
#!/usr/bin/env python2
# -*- coding: utf-8 -*-
print "ℙƴ☂ℌøἤ".decode("utf-8")
If you tried to decode the string literal using anything other than UTF-8 it might not crash (depending on what encoding you used) but would print out the wrong characters. This prints out the wrong string:
#!/usr/bin/env python2
# -*- coding: utf-8 -*-
print "ℙƴ☂ℌøἤ".decode("utf-16")
It prints out 蓢욙芘蓢쎌꒼.
If you're interacting with Python in a REPL it's the same but the default
encoding depends on the environment. It's usually UTF-8. I think it comes from
sys.stdout.encoding
:
Python 2.7.14 (default, Sep 23 2017, 22:06:14)
[GCC 7.2.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.stdout.encoding
'UTF-8'
>>> print "ℙƴ☂ℌøἤ"
ℙƴ☂ℌøἤ
>>> print "ℙƴ☂ℌøἤ".decode("utf-8")
ℙƴ☂ℌøἤ
>>> print "ℙƴ☂ℌøἤ".decode("utf-16")
蓢욙芘蓢쎌꒼
Note that Python will usually escape unicode code points and bytes. It's only
when you print
a string that you see the actual characters:
>>> s = "ℙƴ☂ℌøἤ"
>>> s
'\xe2\x84\x99\xc6\xb4\xe2\x98\x82\xe2\x84\x8c\xc3\xb8\xe1\xbc\xa4'
>>> u = s.decode("utf8")
>>> u
u'\u2119\u01b4\u2602\u210c\xf8\u1f24'
>>> e = u.encode("utf8")
>>> e
'\xe2\x84\x99\xc6\xb4\xe2\x98\x82\xe2\x84\x8c\xc3\xb8\xe1\xbc\xa4'
>>> print(s)
ℙƴ☂ℌøἤ
>>> print(u)
ℙƴ☂ℌøἤ
>>> print(e)
ℙƴ☂ℌøἤ