Skip to content

Instantly share code, notes, and snippets.

@tarekziade
Created October 17, 2012 20:09
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save tarekziade/3907836 to your computer and use it in GitHub Desktop.
Save tarekziade/3907836 to your computer and use it in GitHub Desktop.
Hard to find error in Python 2

So you have some data mapping you format in a string:

>>> data = {'a': 'é', 'b': 's'}
>>> '%(b)s %(a)s' % data
's \xc3\xa9'

Nice. Python is simply formating the string and everything works.

Now what happens if one of the keys is unicode:

>>> data = {'a': 'é', 'b': u's'}
>>> '%(b)s %(a)s' % data
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

That breaks because the formatting tries to decode all the values to unicode if one is a unicode.

I find this error completely illogical since I am trying to get a string. Why not rejecting the formatting explicitly instead of trying to convert things (in the wrong way) ?

Like :

>>> '%(b)s %(a)s' % data
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: the 'b' value is of type <unicode> and the formatting is <str>
@sourcefrog
Copy link

It certainly is annoying, but in terms of Python's implementation it makes a certain amount of sense.

Generally speaking when we're trying to combine byte strings and unicode strings there are a few possible strategies: require explicit conversion; convert only when it's pretty safe (ascii); or assume the byte strings are utf-8.

To some extent this is configurable in Python through setdefaultencoding, but, rather annoyingly, it's hard to set that once the program has started. So you tend to be stuck with the default, which is often ascii. To me that's often a worse choice than the other two, because the programs will often pass simple tests, but unnecessary fail when given utf-8. As you see here.

In the first case, you're combining a set of byte strings. Obviously you can join them together.

In the second case, you're trying to combine a byte string (containing utf-8) and a unicode string. There are a couple of semantics we could want there: 1- just error out and require explicit conversion; 2- make the conservative restriction that the byte string must contain only ascii; or 3- assume it's utf-8.

You want #1 in this case and I think that would often be reasonable in helping people make sure their code is correct, at the cost of sometimes needing to write more explicit code. But then that's supposed to be a Python value.

I think #3 would also be a reasonable pragmatic choice: most unicode data in the world is utf-8, and things that aren't are unlikely to be confused for utf-8, so you'll still get an error.

Unfortunately Python often does #2, which means you can have lots of lurking bugs, like this, that only show up when the code gets a non-ascii unicode string.

In theory, you can use setdefaultencoding to choose between any of these three - if that worked, it would be great, you could pick whatever was appropriate for your particular environment.

Unfortunately in practice Python makes it perversely hard to change the encoding even at startup time. So you basically need to find and fix every bug like this by yourself.

Incidentally, if you're formatting a string to be read by a human, you probably want to make the string format unicode. And if a holds unicode text, it should be converted to a unicode object earlier on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment