Skip to content

Instantly share code, notes, and snippets.

@seanh
Last active February 25, 2022 20:50
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save seanh/0a56cd528714496625662dd9136d0cd3 to your computer and use it in GitHub Desktop.
Save seanh/0a56cd528714496625662dd9136d0cd3 to your computer and use it in GitHub Desktop.
Unicode in Python

Unicode in Python 2 and 3

My notes on unicode handling in Python 2 and 3, with runnable tests.

So far this is very heavily based on Ned Batchelder's Pragmatic Unicode, it's basically my notes on that presentation, with the examples from that presentation in the form of runnable tests. I may add more to these notes in time though.

Requires

Running the tests

$ tox

Notes

Unicode

Unicode is a 1:1 mapping of 1.1 million code points (integers) to characters, starting with the ASCII characters.

For some reason instead of just being written as integers unicode code points are written in hex with a U+ prefix, for example: U+2119. Each code point also has an all-caps ASCII name like DOUBLE-STRUCK CAPITAL P.

In literal unicode strings in Python a unicode code point like U+2119 can be written with \u like "\u2119":

>>> print(u"\u2119")

But this doesn't work in byte strings, the escape sequence is interpreted literally instead of as an escape sequence:

>>> print(b"\u2119")
\u2119

(Byte strings have the byte hexadecimal escape, \x, instead.)

(This is the same in either Python 2 or 3.)

Some code points need a capital \U and 8 hex digits: "\Uabcd1234".

Not all of the code points in unicode have been assigned a character yet.

Encodings

Strings on disk or on the network etc are strings of bytes. Bytes need to be decoded to turn them into characters, and to do that you need to know what encoding the bytes are in (ASCII, UTF-8, ...).

UTF-8 is by far the most popular encoding for bytes.

Some encodings (including UTF-8) are actually part of the unicode standard.

Many encodings have ASCII as a subset. For example an ASCII byte string is also by definition a UTF-8 byte string, can be decoded using UTF-8 and will produce the same unicode string.

5 Facts Of Life

Ned Batchelder's 5 Facts Of Life about unicode in Python:

  1. Input and output to and from a program is always bytes, not unicode. (Files on disk, network connections, ...).

  2. If it involves people using text to communicate, your program needs to support unicode. It needs to be able to handle more than just ASCII.

  3. You need to deal with both bytes and unicode. You can't write a program that only uses one or the other. You have to use both and explicitly convert between them as needed.

  4. You can't infer the encoding of a byte string from the byte string itself. You have to be told out of band.

    So where are you gonna get the encoding from? Content-Type HTTP header, HTML <meta> tags, the encoding attribute on <?xml> declarations, -*- coding: comments in source code files, file type specifications, ...

    You can also guess the encoding but it will only be a guess - decoding might succeed but produce the wrong characters (a false positive).

  5. Declared encodings can be wrong, which will either result in false positives (and garbage characters appearing in your app) or UnicodeDecodeErrors.

3 Pain Relief Tips

And his 3 unicode pain relief tips:

  1. Unicode sandwich. Bytes on the outsides, unicode on the inside. Decode / encode at the edges of your program, as soon as possible / as late as possible. (This is often done for your by libraries and frameworks, e.g. Django decodes byte strings and gives you them as unicode already, so does the json lib, etc.)

    On the insides of your program it should be all unicode, never a byte string.

    If you find a byte string in your code trace it all the way back to the source - where did the byte string come from? And then change your code to decode it at the source.

  2. Always explicitly know whether a string you have is unicode or a byte string and, if it's a byte string, what encoding it's in. Never just say "it's a string".

    If you have to you can use type() to ask whether something's a byte or unicode string.

    You can't tell from looking at a stream of bytes what encoding it's in. You have to be told the encoding, e.g. in the Content-Type HTTP header or HTML <meta> tag etc, or based on the spec of the content type you're reading.

    Sometimes the source tells you the wrong encoding. When this happens you're screwed. If it decodes successfully using the wrong encoding it will probably have decoded to the wrong unicode characters, but you have no way of knowing. If it raises on decoding then you can try falling back on other encodings, but again one of them may decode successfully but you'll have no way of knowing if it's the right characters. (You may be able to apply some heuristics - e.g. check only for expected characters such as ASCII ones, reject any unicode strings that come out with garbage characters in them.)

  3. Throughout your test suite have tests that throw non-ASCII characters into your code.

Python 2 vs Python 3

Unicode is the biggest difference between Python 2 and 3:

  • In Python 2 the str type is for byte strings and unicode is for unicode strings, and bytes is an alias for str. In Python 3 str is unicode strings and bytes is byte strings (and there's no unicode).

  • Literal strings are ASCII-encoded byte strings by default in Python 2, unicode in Python 3.

    But in Python 2:

    • A # -*- coding: utf-8 -*- comment at the top of the file turns the literal byte strings into UTF-8 instead of ASCII.
    • A from __future__ import unicode_literals turns literal strings into unicode instead of byte strings.
  • In either Python 2 or Python 3 you can force a literal byte string with b"..." or force a literal unicode string with u"...".

  • Python 2 automatically decodes byte strings using the ASCII codec (sys.getdefaultencoding()), or sometimes automatically encodes unicode strings using ASCII, whenever you try to do an operation that combines byte strings and unicode strings (concatenating strings, string formatting, comparing strings for equality, dictionary indexing using strings as keys, printing strings, ...)

    This means that all sorts of operations can produce UnicodeDecodeError if you pass byte strings and unicode strings into them.

    Because ASCII is used it's not likely that these implicit decodes will succeed without error but produce the wrong string (a false positive) - ASCII is a subset of most encodings. Instead you'll get UnicodeDecodeErrors when non-ASCII characters appear.

    Python 3 just raises TypeError, it doesn't let you combine byte strings and unicode strings.

  • A unicode string and a byte string can be equal in Python 2 if implicit encoding of the unicode string succeeds and they turn out to have the same bytes.

    In Python 3 a unicode string and a byte string are never equal.

    This can for example cause dictionary lookups that worked in Python 2 to fail in Python 3.

  • In Python 2 byte strings have an encode method!

    Even though they're already encoded. Python 2 will implicitly decode the byte string using ASCII to get a unicode string and then encode that.

    In Python 3 byte strings don't have an encode method.

  • In Python 2 reading a string from a file using "r" mode returned a byte string. In Python 3 "r" mode returns unicode strings, but "rb" mode still returns byte strings.

    open("hello.txt", "r").read()
    open("hello.txt", "rb").read()
    

    The "r" mode in Python 3 uses locale.getpreferredencoding() for the implicit decoding from bytes to unicode. You can override this with the encoding argument to open().

    open() doesn't have any encoding parameter in Python 2.

    So open() in "r" mode can raise UnicodeDecodeError in Python 3, it couldn't in Python 2.

    You should always specify an encoding when reading text from file in Python 3.

Unicode literals

  • You can use \u2119 to insert any unicode character into a unicode literal by code point.

Unicode strings

.encode(encoding). Raises UnicodeEncodeError.

Byte strings

.decode(encoding). Raises UnicodeDecodeError.

Printing strings

print uses your terminal's encoding, probably UTF-8, I think.

stdin and stdout

These are "pre-opened files" with a certain encoding that depends on the system.

What encoding are literal byte strings in?

Note this is all Python 2. In Python 3 literal non-ASCII chars are not allowed in literal byte strings: https://docs.python.org/3.3/reference/lexical_analysis.html#string-and-bytes-literals

By default literal byte strings are ASCII-encoded. Save this program to an encoding.py file:

#!/usr/bin/env python2
print "ℙƴ☂ℌøἤ"

Running ./encoding.py will crash with:

SyntaxError: Non-ASCII character '\xe2' in file foo.py on line 1, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

Since string literals are ASCII by default you can't put non-ASCII characters in them.

Using a unicode string literal (print u"ℙƴ☂ℌøἤ") will also crash with the same error.

You can put non-ASCII characters in string literals if you escape them. This will work, using UTF-8 escape codes:

#!/usr/bin/env python2
print "\xe2\x84\x99\xc6\xb4\xe2\x98\x82\xe2\x84\x8c\xc3\xb8\xe1\xbc\xa4"

Prints out ℙƴ☂ℌøἤ (this may depend on that my terminal's encoding happens to be UTF-8, which matches the escape codes, otherwise it'd print garbage).

Or, better, you can use escaped unicode code points, this will print out ℙƴ☂ℌøἤ, note the leading u (this should work regardless of the encoding of the terminal you run it from I think):

#!/usr/bin/env python2
print u"\u2119\u01b4\u2602\u210c\u1f24"

https://docs.python.org/3.6/reference/lexical_analysis.html#encoding-declarations

But if you put a # -*- coding: utf-8 -*- comment at the top of your file then Python understands string literals to be UTF-8, so this program will work:

#!/usr/bin/env python2
# -*- coding: utf-8 -*-
print "ℙƴ☂ℌøἤ"

It prints out ℙƴ☂ℌøἤ. As does this:

#!/usr/bin/env python2
# -*- coding: utf-8 -*-
print "ℙƴ☂ℌøἤ".decode("utf-8")

If you tried to decode the string literal using anything other than UTF-8 it might not crash (depending on what encoding you used) but would print out the wrong characters. This prints out the wrong string:

#!/usr/bin/env python2
# -*- coding: utf-8 -*-
print "ℙƴ☂ℌøἤ".decode("utf-16")

It prints out 蓢욙芘蓢쎌꒼.

If you're interacting with Python in a REPL it's the same but the default encoding depends on the environment. It's usually UTF-8. I think it comes from sys.stdout.encoding:

Python 2.7.14 (default, Sep 23 2017, 22:06:14) 
[GCC 7.2.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.stdout.encoding
'UTF-8'
>>> print "ℙƴ☂ℌøἤ"
ℙƴℌøἤ
>>> print "ℙƴ☂ℌøἤ".decode("utf-8")
ℙƴℌøἤ
>>> print "ℙƴ☂ℌøἤ".decode("utf-16")
蓢욙芘蓢쎌꒼

Note that Python will usually escape unicode code points and bytes. It's only when you print a string that you see the actual characters:

>>> s = "ℙƴ☂ℌøἤ"
>>> s
'\xe2\x84\x99\xc6\xb4\xe2\x98\x82\xe2\x84\x8c\xc3\xb8\xe1\xbc\xa4'
>>> u = s.decode("utf8")
>>> u
u'\u2119\u01b4\u2602\u210c\xf8\u1f24'
>>> e = u.encode("utf8")
>>> e
'\xe2\x84\x99\xc6\xb4\xe2\x98\x82\xe2\x84\x8c\xc3\xb8\xe1\xbc\xa4'
>>> print(s)
ℙƴℌøἤ
>>> print(u)
ℙƴℌøἤ
>>> print(e)
ℙƴℌøἤ
__pycache__
.pytest_cache
.tox
Hello world
# -*- coding: utf-8 -*-
import sys
import pytest
is_python3 = sys.version_info[0] > 2
if is_python3:
unicode_type = str
bytes_type = bytes
else:
unicode_type = unicode
bytes_type = str
# Skip tests marked @python2 if we're running in Python 3.
python2 = pytest.mark.skipif(is_python3, reason="This test only works in Python 2")
# Skip tests marked @python3 if we're running in Python 2.
python3 = pytest.mark.skipif(not is_python3, reason="This test only works in Python 3")
class TestLiteralStrings(object):
@python2
def test_in_python_2_literal_strings_are_byte_strings_by_default(self):
# Each character in this string represents a byte (or sequence of bytes
# for certain characters) in either ASCII or the encoding given in the
# -*- coding comment at the top of the file.
assert type("byte_string") == str
@python3
def test_in_python_3_string_literals_are_unicode(self):
# The type of a string literal in Python 3 is "str", as in Python 2, but
# in Python 3 "str" means a unicode string (sequence of unicode code
# points) not a byte string (sequence of encoded bytes) as in Python 2!
assert type("Hi") == str
def test_you_can_also_do_a_literal_byte_string_with_a_b_prefix(self):
assert type(b"byte_string") == bytes_type
def test_literal_unicode_strings_have_u_prefix(self):
# Each character in this string represents a code point.
assert type(u"unicode_string") == unicode_type
@python2
def test_in_python_2_bytes_is_an_alias_for_str(self):
assert bytes == str
@python3
def test_theres_no_type_called_unicode_in_python3(self):
# There's no "unicode" type in Python 3 ("str" is the unicode type).
with pytest.raises(NameError, match="^name 'unicode' is not defined"):
unicode
class TestLength(object):
def test_the_length_of_a_unicode_string_is_its_number_of_code_points(self):
assert len(u"Hi \u2119\u01b4\u2602\u210c\xf8\u1f24") == 9
def test_the_length_of_a_byte_string_is_its_number_of_bytes(self):
# The length of a byte string counts its number of bytes not its number
# of "characters" or unicode code points.
assert len(u"Hi \u2119\u01b4\u2602\u210c\xf8\u1f24".encode("utf-8")) == 19
class TestDefaultEncoding(object):
@python2
def test_in_python_2_the_default_encoding_is_ascii(self):
assert sys.getdefaultencoding() == "ascii"
@python3
def test_in_python_3_the_default_encoding_is_uff8(self):
assert sys.getdefaultencoding() == "utf-8"
class TestEncode(object):
def test_encoding_a_unicode_string_turns_it_into_a_byte_string(self):
byte_string = u"Hi \u2119\u01b4\u2602\u210c\xf8\u1f24".encode("utf-8")
assert type(byte_string) == bytes_type
assert byte_string == b'Hi \xe2\x84\x99\xc6\xb4\xe2\x98\x82\xe2\x84\x8c\xc3\xb8\xe1\xbc\xa4'
def test_encode_raises_UnicodeEncodeError(self):
# Not every encoding supports all possible unicode characters.
# For example ASCII only supports ASCII chars.
# If you try to encode a unicode string containing non-ASCII charas using
# the ASCII encoding it'll raise UnicodeEncodeError.
with pytest.raises(UnicodeEncodeError, match="^'ascii' codec can't encode character"):
u"Hello ✋".encode("ascii")
@python2
def test_in_python_2_you_can_encode_a_byte_string(self):
# Byte strings have an encode method in Python 2!
# It implicitly decodes the byte string to unicode and then encodes
# the unicode string.
original_byte_string = b"hello"
new_byte_string = original_byte_string.encode("utf-8")
assert type(new_byte_string) == str
assert new_byte_string == 'hello'
@python2
def test_in_python_2_encoding_a_byte_string_can_raise_UnicodeDecodeError(self):
byte_string = b"Hi \u2119\u01b4\u2602\u210c\xf8\u1f24"
with pytest.raises(UnicodeDecodeError):
byte_string.encode("ascii")
@python3
def test_in_python_3_you_cant_encode_a_byte_string(self):
byte_string = b"Hi \u2119\u01b4\u2602\u210c\xf8\u1f24"
with pytest.raises(AttributeError):
byte_string.encode("ascii")
@python2
def test_in_python_2_encoding_encodes_to_ascii_by_default(self):
# By default encode() uses the system default encoding which is ascii.
with pytest.raises(UnicodeEncodeError, match="^'ascii' codec can't encode character"):
u"Hi \u2119\u01b4\u2602\u210c\xf8\u1f24".encode()
def test_you_can_tell_encode_to_replace_incompatible_chars_with_question_marks_instead_of_crashing(self):
unicode_string = u"Hi \u2119\u01b4\u2602\u210c\xf8\u1f24"
byte_string = unicode_string.encode("ascii", "replace")
assert byte_string == b"Hi ??????"
def test_you_can_tell_encode_to_replace_incompatible_chars_with_XML_instead_of_crashing(self):
unicode_string = u"Hi \u2119\u01b4\u2602\u210c\xf8\u1f24"
byte_string = unicode_string.encode("ascii", "xmlcharrefreplace")
# This output can actually be used in an XML or HTML file and will render
# correctly in a browser.
assert byte_string == b"Hi &#8473;&#436;&#9730;&#8460;&#248;&#7972;"
def test_you_can_tell_encode_to_omit_incompatible_chars_instead_of_crashing(self):
unicode_string = u"Hi \u2119\u01b4\u2602\u210c\xf8\u1f24"
byte_string = unicode_string.encode("ascii", "ignore")
assert byte_string == b"Hi "
class TestDecode(object):
def test_decoding_a_byte_string_turns_it_into_a_unicode_string(self):
byte_string = b'Hi \xe2\x84\x99\xc6\xb4\xe2\x98\x82\xe2\x84\x8c\xc3\xb8\xe1\xbc\xa4'
unicode_string = byte_string.decode("utf-8")
assert type(unicode_string) == unicode_type
assert unicode_string == u'Hi \u2119\u01b4\u2602\u210c\xf8\u1f24'
def test_decode_raises_UnicodeDecodeError(self):
utf8_byte_string = b"Hello \xe2\x9c\x8b"
with pytest.raises(UnicodeDecodeError, match="^'ascii' codec can't decode byte"):
utf8_byte_string.decode("ascii")
def test_decoding_a_utf8_string_as_ascii_will_work_if_there_are_no_non_ascii_chars(self):
# Since UTF8 is a super-set of ASCII, as long as the byte string doesn't
# contain any non-ASCII characters then decoding a UTF8 string as ASCII
# will work (but you should never do this!)
utf8_byte_string = b"Hello :wave:"
assert utf8_byte_string.decode("ascii") == u"Hello :wave:"
@pytest.mark.parametrize("wrong_encoding", ("iso8859-1", "utf-16-le", "utf-16-be", "shift-jis"))
def test_decoding_using_the_wrong_encoding_sometimes_works(self, wrong_encoding):
utf8_byte_string = b'\x48\x69\xe2\x84\x99\xc6\xb4\xe2\x98\x82\xe2\x84\x8c\xc3\xb8\xe1\xbc\xa4'
correct_unicode = utf8_byte_string.decode("utf-8")
wrong_unicode = utf8_byte_string.decode(wrong_encoding)
# The string decodes without error, but it produces the wrong code points,
# the wrong string of characters. This is one way that you can end up
# displaying garbage characters to users.
assert wrong_unicode != correct_unicode
def test_decoding_with_utf8_can_also_raise_UnicodeDecodeError(self):
# Decoding a byte string using the UTF8 codec can also raise
# UnicodeDecodeError, for example is the string is encoded using some
# other encoding (neither UTF8 nor ASCII) or is just an invalid byte
# string.
with pytest.raises(UnicodeDecodeError, match="codec can't decode byte"):
b"\x78\x9a\xbc\xde\xf0".decode("utf-8")
@python2
def test_in_python_2_decode_decodes_from_ascii_by_default(self):
with pytest.raises(UnicodeDecodeError, match="^'ascii' codec can't decode byte"):
b'Hi \xe2\x84\x99\xc6\xb4\xe2\x98\x82\xe2\x84\x8c\xc3\xb8\xe1\xbc\xa4'.decode()
@python3
def test_in_python_3_decode_decodes_from_utf8_by_default(self):
with pytest.raises(UnicodeDecodeError, match="^'utf-8' codec can't decode byte"):
b'\x78\x9a\xbc\xde\xf0'.decode()
def test_you_can_tell_decode_to_omit_incompatible_bytes(self):
utf8_byte_string = b"Hello \xe2\x9c\x8b"
unicode_string = utf8_byte_string.decode("ascii", "ignore")
assert unicode_string == u"Hello "
def test_you_can_tell_decode_to_replace_incompatible_bytes_with_question_marks(self):
utf8_byte_string = b"Hello \xe2\x9c\x8b"
unicode_string = utf8_byte_string.decode("ascii", "replace")
# In this case it uses a non-ASCII question mark code point.
# Note that there are three ?'s here - one for every incompatible _byte_ in
# the UTF8 byte string. The single ✋character is three bytes.
assert unicode_string == u"Hello ���"
class TestImplicitDecoding(object):
@python2
def test_in_python_2_concatenating_strings_implicitly_decodes_byte_strings_to_unicode_strings(self):
# The byte string is implicitly decoded to a unicode string using the
# system's default encoding (ascii by default).
concatenated = u"Hello " + b"world"
assert type(concatenated) == unicode_type
assert concatenated == u"Hello world"
@python2
def test_in_python_2_concatenating_strings_can_raise_UnicodeDecodeError(self):
with pytest.raises(UnicodeDecodeError, match="^'ascii' codec can't decode byte"):
# The second string here us a UTF8 byte string containing a non-ASCII
# character. Python will try to decode it to unicode using ASCII and
# crash.
u"Hello " + b"\xe2\x9c\x8b"
@python3
def test_in_python_3_you_cannot_concatenate_unicode_with_bytes(self):
# Python 3 never tries to implicitly decode byte strings using a default
# encoding when you try to concatenate them with unicode strings, format
# them, etc.
#
# Instead, it explicitly refuses to let you do that.
with pytest.raises(TypeError, match="^must be str, not bytes$"):
"Hello " + b"world"
@python2
def test_in_python_2_formatting_strings_can_raise_UnicodeDecodeError(self):
with pytest.raises(UnicodeDecodeError, match="^'ascii' codec can't decode byte"):
u"Hello %s" % b"\xe2\x9c\x8b"
@python3
def test_python_3_calls_repr_when_you_format_byte_strings_into_unicode_strings(self):
# Unlike, for example, the + operator the % operator (when used with
# %s) will call str() on its argument.
#
# In this case the argument is a byte string and str() on a byte string
# just falls back on repr() which returns "b'\\xe1\\x9c\\x8b'".
assert u'Hello %s' % b'\xe1\x9c\x8b' == u"Hello b'\\xe1\\x9c\\x8b'"
@python3
def test_in_python_3_you_cant_format_unicode_strings_into_byte_strings(self):
with pytest.raises(TypeError, match="^%b requires a bytes-like object, or an object that implements __bytes__, not 'str'"):
assert b'\xe1\x9c\x8b %s' % u"Hello"
@python2
def test_in_python_2_byte_strings_and_unicode_strings_can_be_equal(self):
assert b"Hello" == u"Hello"
@python3
def test_in_python_3_byte_strings_and_unicode_strings_cannot_be_equal(self):
assert b"Hello" != u"Hello"
@python2
def test_in_python_2_you_can_use_a_byte_string_to_match_a_unicode_dict_key(self):
assert {u"hello": u"world"}[b"hello"] == u"world"
@python3
def test_in_python_3_you_cannot_use_a_byte_string_to_match_a_unicode_dict_key(self):
with pytest.raises(KeyError):
{u"hello": u"world"}[b"hello"]
@python2
def test_in_python_2_you_can_use_a_unicode_string_to_match_a_byte_string_dict_key(self):
assert {b"hello": b"world"}[u"hello"] == b"world"
@python3
def test_in_python_3_you_cannot_use_a_unicode_string_to_match_a_byte_string_dict_key(self):
with pytest.raises(KeyError):
{b"hello": b"world"}[u"hello"]
@python2
def test_trying_to_encode_a_byte_string_implicitly_decodes_it_to_unicode_using_ascii_first(self):
utf8_byte_string = "Hello ✋"
# This is Python 2 trying to be even more "helpful". Even though it makes
# no sense to call .encode() on an (already-encoded) byte string, if you
# try to do so Python 2 will try to implicitly decode that byte string to
# unicode using ascii first.
#
# This means that calling **encode()** can raise Unicode**Decode**Error!
#
# In Python 3 byte strings just don't have an encode() method.
with pytest.raises(UnicodeDecodeError, match="^'ascii' codec can't decode byte"):
utf8_byte_string.encode("utf-8")
class TestReadingFromFiles(object):
@python2
def test_in_python_2_reading_text_from_file_returns_bytes_by_default(self):
# Reading text from a file in "r" mode returns bytes in Python 2.
text = open("hello.txt", "r").read()
assert type(text) == bytes_type
assert text == b"Hello world\n"
@python3
def test_in_python_3_reading_text_from_file_returns_unicode_by_default(self):
# Reading text from a file in "r" mode returns unicode in Python 3.
#
# This decodes the bytes of the file using locale.getpreferredencoding()
# (UTF-8 on Ubuntu).
text = open("hello.txt", "r").read()
assert type(text) == unicode_type
assert text == u"Hello world\n"
def test_you_can_read_bytes_from_file_too(self):
text = open("hello.txt", "rb").read()
assert type(text) == bytes_type
assert text == b"Hello world\n"
[tox]
skipsdist = true
envlist = py27, py36
[testenv]
deps = pytest
commands = pytest test_unicode.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment