Skip to content

Instantly share code, notes, and snippets.

@gravesm
Created November 6, 2015 18:26
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save gravesm/efcb4798b2ecc6f29710 to your computer and use it in GitHub Desktop.
Save gravesm/efcb4798b2ecc6f29710 to your computer and use it in GitHub Desktop.
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Unicode and Strings in Python 2\n",
"\n",
"When dealing with strings and unicode in most languages there are few rules to live by. This is no different in python, regardless of whether you are talking about 2 or 3.\n",
"\n",
"* Unicode is an abstraction; bytes are real.\n",
"* You need to know whether you are dealing with bytes or unicode.\n",
"* With few exceptions, your application will receive bytes as input and will have to output bytes. You should convert those bytes to unicode as soon as possible. Think of your application as a unicode sandwich--bytes in, bytes out, unicode in between.\n",
"* If you are dealing with bytes you need to know what encoding those bytes are."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Strings and bytes\n",
"\n",
"In Python 2, raw strings are bytes. Let's look at some basic properties of strings:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 1,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"'Ø' == b'Ø'"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"'Ø' == b'\\xc3\\x98'"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"2"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len('Ø') # make sure you understand why this is the case or you will be lost"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Unicode\n",
"\n",
"We can explicitly declare a string as unicode using the `u''` syntax. Unicode codepoints can be specified by prefixing the hex value of the codepoint with `\\u`:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"u'Ø' == u'\\u00d8'"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"1"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(u'Ø') # compare to our example with bytes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Converting between strings and unicode\n",
"\n",
"The two methods you need to remember are `.decode()` and `.encode()`, but you need to make sure you are using them in the correct situations. `.decode()` is used to convert a byte string into a unicode string and `.encode()` is used to convert a unicode string into a byte string. Let's look at a few examples:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"b'\\xc3\\x98'.decode('utf-8') == u'\\u00d8'"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"'Ø'.decode('utf-8') == u'\\u00d8'"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"u'\\u00d8'.encode('utf-8') == b'\\xc3\\x98'"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"u'Ø'.encode('utf-8') == 'Ø'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you stopped here you would understand enough of unicode in python 2 for about 95% of what you'll write. Just remember to make your unicode sandwich and you'll be fine. But let's dig a little deeper and look at what can go wrong and why."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## When shit gets fucked up\n",
"\n",
"The first thing you need to understand is that there's a fair amount of implicit conversion going on under the hood in python 2. Let's generate our first error--technically, it's a warning:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/mgraves/.local/venvs/jupyter2/lib/python2.7/site-packages/ipykernel/__main__.py:1: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal\n",
" if __name__ == '__main__':\n"
]
},
{
"data": {
"text/plain": [
"False"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"b'\\xc3\\x98' == u'\\u00d8'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Hopefully, you are not surprised that the comparison evaluates to false. We've already seen above that these are two different beasts, but if you look at the warning, you'll see that we haven't really even compared these two objects. It's a false `False`. Python tries to convert our byte string into a unicode string, but can't, so it just says these things aren't equal and calls it a day. But we know we can convert `b'\\xc3\\x98'` to unicode, because we just did earlier, so what gives?\n",
"\n",
"Let's look at a different example:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"True"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"b'f' == u'f'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"What the hell? I'm going to drag this out a little longer and introduce our second error, one you have almost certainly seen if you've written much python:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [
{
"ename": "UnicodeDecodeError",
"evalue": "'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)",
"output_type": "error",
"traceback": [
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[1;31mUnicodeDecodeError\u001b[0m Traceback (most recent call last)",
"\u001b[1;32m<ipython-input-12-754acf09319a>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m()\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[1;34mb'\\xc3\\x98'\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mdecode\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m'ascii'\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[1;31mUnicodeDecodeError\u001b[0m: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)"
]
}
],
"source": [
"b'\\xc3\\x98'.decode('ascii')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Oh god, the dreaded `UnicodeDecodeError`. You should know enough now to understand why we're getting this, though. We started with a byte string and tried to decode it as ASCII. The first byte in our sequence, `\\xc3`, isn't a valid ASCII byte, so clearly this byte string isn't ASCII.\n",
"\n",
"On the other hand, the following works as we would expect:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"u'f'"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"b'f'.decode('ascii')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can make a pretty good guess as to what's going on when we try to compare a byte string with a unicode string. Python will try to decode the byte string to unicode, assuming the byte string is ASCII encoded. This solves our first mystery.\n",
"\n",
"Let's take a look at another error:"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"collapsed": false
},
"outputs": [
{
"ename": "UnicodeDecodeError",
"evalue": "'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)",
"output_type": "error",
"traceback": [
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[1;31mUnicodeDecodeError\u001b[0m Traceback (most recent call last)",
"\u001b[1;32m<ipython-input-14-cc2138a028bb>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m()\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[1;34m'Ø'\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mencode\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m'utf-8'\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[1;31mUnicodeDecodeError\u001b[0m: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)"
]
}
],
"source": [
"'Ø'.encode('utf-8')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Wait, why are we getting a decode error when we try to encode? And why is it trying to use ASCII when I explicitly said UTF-8? I can't tell you the number of times I've asked this. But the question we should really be asking is why are we trying to encode a byte string? (We will be asking ourselves a different question in Python 3.)\n",
"\n",
"Here's a great example of why we need to always know what we are dealing with. Byte strings are already encoded, so this is a pretty silly statement. Given what we've seen already, and the error message here, we can guess at what's happening. `.encode()` takes a unicode string and converts it to a byte string, but we're giving it a byte string here. Python first tries to turn our byte string into unicode before encoding it to UTF-8. And what encoding does it assume our byte string is? You guessed it: ASCII.\n",
"\n",
"You should understand now why this works:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"'f'"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"'f'.encode('utf-8')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Another common source of the previous error is string concatenation. The same thing is at work. Python will attempt to convert to unicode on your behalf assuming ASCII for byte strings:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {
"collapsed": false
},
"outputs": [
{
"ename": "UnicodeDecodeError",
"evalue": "'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)",
"output_type": "error",
"traceback": [
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[1;31mUnicodeDecodeError\u001b[0m Traceback (most recent call last)",
"\u001b[1;32m<ipython-input-16-9bfb7add2141>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m()\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[1;34mu'f'\u001b[0m \u001b[1;33m+\u001b[0m \u001b[1;34m'ØØ'\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[1;31mUnicodeDecodeError\u001b[0m: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)"
]
}
],
"source": [
"u'f' + 'ØØ'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## When shit gets fucked up and we don't realize it\n",
"\n",
"In the cases we've seen so far, python does a pretty good job of telling us when we've done something stupid. But the worst pitfalls of encoding problems are the ones we don't discover until it's too late.\n",
"\n",
"Let's start with a contrived example, first:"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"collapsed": false
},
"outputs": [
{
"data": {
"text/plain": [
"u'\\xc3\\x98'"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"u'\\u00d8'.encode('utf-8').decode('latin-1')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The first thing to note is that we haven't generated any errors. We started with a unicode string, converted it to a byte string and converted it back to unicode, just the way we are supposed to. No errors here. The next thing to note is that we didn't end up where we started. The unicode character we started with, Ø, has a different sequence of bytes in UTF-8 than it does in latin-1.\n",
"\n",
"Now, let's imagine a more realistic scenario:"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Ø\n",
"Ø\n"
]
}
],
"source": [
"with open('foo.txt', 'wb') as fp:\n",
" fp.write(u'\\u00d8'.encode('utf-8'))\n",
"\n",
"# some time later...\n",
"with open('foo.txt') as fp:\n",
" txt = fp.read()\n",
"\n",
"print(txt.decode('utf-8'))\n",
"print(txt.decode('latin-1'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here we see our contrived example played out in an all too common scenario. Files get passed around and edited and we are never really sure what encoding has been used. We can try to determine encoding based on heuristic analysis, but in this case, we see that both encodings are equally likely. There is really no substitute for knowing what encoding you are dealing with."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Source code encoding and PEP 0263\n",
"\n",
"You may have wondered exactly what's going on when I type `'Ø'` in the interpreter. When I put that in my source code, it's important to understand how python will see that. A python module is a file, in other words, it's a sequence of bytes (the interpreter in a terminal is file-like). Just like in our previous example, when the file is read it needs to be decoded so that when python sees the byte sequence `\\xc3\\x98` it knows to read that as `Ø`. What encoding is used in the file depends on your editor and your environment. In most cases, this will be UTF-8, but it can vary.\n",
"\n",
"PEP 0263 addresses source code encoding by allowing a syntax to explicitly tell python how your source code is encoded. This is why you will often see people place the following at the top of their python code:\n",
"\n",
"`# -*- coding: utf-8 -*-`"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 2",
"language": "python",
"name": "python2"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.9"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment