Skip to content

Instantly share code, notes, and snippets.

@afeblot
Last active October 4, 2023 15:17
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save afeblot/4147354f36cf7204cfcea1c07af87a7f to your computer and use it in GitHub Desktop.
Save afeblot/4147354f36cf7204cfcea1c07af87a7f to your computer and use it in GitHub Desktop.
Encoding in Python, once and for all

Encoding in Python, once and for all

Reference

You've all had the following mistake:

UnicodeDecodeError: 'machine' codec can't decode character 'trucmuche' in position x: ordinal not in range(z)

And then, to get out of it, you suffered a lot.

The problem is that most of the time, ignoring encoding works: we work in homogeneous environments and always with data in the same format, or a more or less compatible format.

But text is complicated, terribly complicated, and when the shit hits the fan, if you don't know what you're doing, you'll never get out of it.

This is especially true with Python, because :

  • By default, Python crashes on encoding errors, whereas other languages (like PHP) manage to come up with something (that means nothing, that can corrupt your whole database, but doesn't crash).
  • Python is used in heterogeneous environments. When you code in JS on the browser, you almost never have to worry about encoding: the browser handles almost everything for you. In Python, whenever you read a file and display it in a terminal, there are potentially 3 different encodings.
  • Python 2.7 has very strict default settings, not necessarily adapted to modern computing (ASCII code files, for example).

By the end of this article, you'll know how to get out of all the crappy encoding situations.

Rule number 1: Plain text doesn't exist.

When you have text somewhere (a terminal, a file, a database...), it's inevitably represented in the form of 0 and 1.

The correlation between this sequence of 0 and 1 and the letter is made in a huge table containing all the letters on one side, and all the combinations of 0 and 1 on the other. There's no magic involved. It's a huge table stored somewhere in your computer. If you don't have this table, you can't read text. Even the simplest text.

Unfortunately, in the early days of computing, almost every country created its own table, and these tables are incompatible with each other: for the same combination of 0 and 1, they give a different character or nothing at all.

The bad news is that they're still in use today.

These arrays are called encodings, and there are many of them. Here's a list of those supported by Python:

>>> import encodings
>>> print(''.join('- ' + e + '\n' for e in sorted(set(encodings.aliases.aliases.values()))))
- ascii
- base64_codec
- big5
- big5hkscs
- bz2_codec
- cp037
- cp1026
- cp1125
- cp1140
- cp1250
- cp1251
- cp1252
- cp1253
- cp1254
- cp1255
- cp1256
- cp1257
- cp1258
- cp273
- cp424
- cp437
- cp500
- cp775
- cp850
- cp852
- cp855
- cp857
- cp858
- cp860
- cp861
- cp862
- cp863
- cp864
- cp865
- cp866
- cp869
- cp932
- cp949
- cp950
- euc_jis_2004
- euc_jisx0213
- euc_jp
- euc_kr
- gb18030
- gb2312
- gbk
- hex_codec
- hp_roman8
- hz
- iso2022_jp
- iso2022_jp_1
- iso2022_jp_2
- iso2022_jp_2004
- iso2022_jp_3
- iso2022_jp_ext
- iso2022_kr
- iso8859_10
- iso8859_11
- iso8859_13
- iso8859_14
- iso8859_15
- iso8859_16
- iso8859_2
- iso8859_3
- iso8859_4
- iso8859_5
- iso8859_6
- iso8859_7
- iso8859_8
- iso8859_9
- johab
- koi8_r
- kz1048
- latin_1
- mac_cyrillic
- mac_greek
- mac_iceland
- mac_latin2
- mac_roman
- mac_turkish
- mbcs
- ptcp154
- quopri_codec
- rot_13
- shift_jis
- shift_jis_2004
- shift_jisx0213
- tactis
- tis_620
- utf_16
- utf_16_be
- utf_16_le
- utf_32
- utf_32_be
- utf_32_le
- utf_7
- utf_8
- uu_codec
- zlib_codec

And some have several names (aliases), so we could count more:

>>> len(encodings.aliases.aliases.keys())
326

When you display text on a terminal with a simple print command, your computer implicitly looks for the table it thinks is the most suitable, and does the translation. Even for the simplest text. Even for a single space.

But above all, it means that your own code IS in an encoding. And you MUST know which one.

Rule number 2: utf8 is the universal language, use it!

There's an encoding that tries to bring all the world's languages together, and it's called unicode. Unicode is a gigantic table containing combinations of 1 and 0 on one side, and the characters of all possible languages on the other: Chinese, Arabic, French, Spanish, Russian...

Well, it doesn't contain absolutely everything yet, but it covers enough ground to eliminate 99.999999999% of the world's text communication problems between machines.

The downside of Unicode is that it is slower and takes up more space than other representations of the same text. Today, the crappiest phone has 10 times the necessary power, and this is no longer a concern: it can be used almost anywhere (except perhaps in drastic embedded applications) without even thinking about it. All major languages, all major services, all major softwares support unicode.

There are several concrete implementations of unicode, the most famous of which is "UTF 8".

The moral is: By default, use utf-8.

Once, at a job interview, a guy criticized me for using UTF8 because "it posed encoding problems". Please understand that utf-8 poses no encoding problems whatsoever. It's all the other codecs in the world that pose encoding problems. UTF-8 is certainly the only one that doesn't pose any problems.

UTF 8 is the only encoding to which, today, you can convert to and from (virtually) any other codec in the world. It's an Esperanto. It's a rosetta stone. It is to text what gold is to economics.

If you have an "encoding problem" with UTF8, it's because you don't know what encoding your text is currently in, or how to convert it. That's all there is to it.

There's almost no reason not to use UTF8 today (except on old systems or systems where resources are so limited that you wouldn't use Python anyway).

Use utf8. Everywhere. All the time.

If you're communicating with a system that doesn't understand UTF8, convert.

But keep your part in UTF8.

Rule number 3: master the encoding of your code

The file in which you write your code is in an encoding and is not linked to your OS. Your editor takes care of that. Learn how to set your editor to use the encoding you want.

And the encoding you want is UTF8.

If you don't know what encoding your code is in, you can't manipulate text and guarantee bug-free output.

You just CAN'T.

So reflex: you configure your text editor to save all your new files in UTF8 by default.Now. Right now.

Look in the editor's documentation, in the help or Google it, but do it.

Then declare this encoding on the first line of each code file with the following expression:

# coding: encoding

For example :

# coding: utf8

This is a specific feature of Python: if the encoding of the file is different from the language's default encoding, it must be declared, otherwise the program will crash on the first conversion. In Python 2.7, the default encoding is ASCII, so you almost always have to declare it. In Python 3, the default encoding is UTF8, so you can omit it if you use it. Which is what you'll be doing after reading this article.

Next, there are two types of string in Python:

  • The encoded string: type str in Python 2.7, bytes in Python 3.
  • The decoded string: type unicode in Python 2.7, and str in Python 3 (sic).

Demo:

$ python2.7
Python 2.7.3 (default, Aug  1 2012, 05:14:39)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> type('chaine') # bits => coded
<type 'str'>
>>> type(u'chaine') # unicode => decoded
<type 'unicode'>
$ python3
Python 3.2.3 (default, Oct 19 2012, 20:10:41)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> type("chaine") # unicode => decoded
<class 'str'>
>>> type(b"chaine") # bits => coded
<class 'bytes'>

Your goal is to have only 'unicode' strings in your code.

In Python 3, this is automatic. All strings are 'unicode' (called str in this version - I know, I know, it's confusing as hell) by default.

In Python 2.7, on the other hand, you need to prefix the string with a u. So, in your code, ALL your strings must be declared as follows:

u"your string"

Yes, it's a pain. But it's essential. Once again, there's no alternative (say it in Thatcher's voice if it turns you on).

If you want, you can enable Python 3 behavior in Python 2.7 by putting this at the beginning of EACH of your modules:

from __future__ import unicode_literals

This only affects the current file, never other modules.

You can put it at iPython startup too.

To sum up:

  • Set your editor to UTF8.
  • For python 2.7:
    • Put # coding: utf8 at the beginning of your modules.
    • Prefix all your strings with u or do from __future__ import unicode_literals at the start of each module.

If you don't do this, your code will work. Most of the time. And then, one day, in a particular situation, it won't work. At all.

Oh, and it doesn't matter if you have old modules in other encodings. As long as you use 'unicode' objects everywhere, they'll work seamlessly together.

Rule number 4: decode all program inputs

The difficult part of this tip is knowing what an input is.

I'll give you a simple definition: anything that isn't part of your program code and is processed in your program is input.

The text of files, the names of those files, the return of system calls, the return of a parsed command line, user input on a terminal, the return of an SQL query, the downloading of data from the Web, and so on.

These are all inputs.

Like all texts in the world, inputs are in an encoding. And you MUST know which one.

Don't get me wrong, if you don't know the encoding of your inputs, it'll work most of the time, and then one day, it'll crash.

There's no alternative (bis).

And there's no way of reliably detecting an encoding.

So, either the supplier of the data gives you this information (database settings, software documentation, OS configuration, customer spec, phone call to the supplier...), or you're screwed.

You can't read a simple file if you don't know its encoding. Period.

If it has worked for you so far, you've been lucky: most of your files were in the encoding of your editor and your system. As long as you're working on your machine, everything's fine.

If you're reading an HTML page, the encoding is often declared in the META tag or in a header.

If you're writing in a terminal, terminal encoding can be accessed with sys.(stdin|stdout).encoding.

If you're manipulating file names, the encoding of the current file system can be retrieved with sys.getfilesystemencoding().

But sometimes there's no other way of obtaining this information than to ask the person who produced the data. Sometimes, even, the declared encoding is wrong.

In any case, you need this information.

And once you have it, you need to decode the received text.

The simplest way to do this is:

your_chain = your_chain.decode('codec_name')

[python3] The text will be of type bytes, and decode() returns (if you provide it with the right codec ;-)), a unicode str version.

Example, to obtain a unicode str string from a bytes string encoded in utf8 :

python3
>>> a_string = b'Cha\xc3\xaene' # my file is coded in UTF8, so the string is in UTF8
>>> type(a_string)
<class 'bytes'>
>>> a_string = a_string.decode('utf8')
>>> type(a_string)
<class 'str'>
>>> a_string
'Chaîne'

So as soon as you read a file, retrieve an answer from a database or pass arguments from a terminal, call decode() on the received string.

Rule number 5: encode all program outputs

The difficult part of this tip is knowing what an output is.

Again, a simple definition: any data you process that will be read by something other than your code is output.

A print() in a terminal is an output, a write() in a file is an output, an UPDATE in SQL is an output, sending in a socket is an output, and so on.

The rest of the world can't read Python's unicode objects. If you write these objects to a file, terminal or database, Python will automatically convert them to a bytes object, and the encoding used will depend on the context.

Unfortunately, there's a limit to Python's ability to decide on the right encoding.

So, just as you need to know the encoding of a text as input, you need to know the encoding expected by the system you're communicating with as output: know the encoding of the terminal, database or file system you're writing to.

If you can't find out (Web page, API, etc.), use UTF8.

To do this, simply call encode() on any object of type str:

a_string = a_string.encode('codec_name')

For example, to convert a unicode str object to a utf8 bytes object:

python3
>>> a_string = 'Chaîne'
>>> type(a_string)
<class 'str'>
>>> a_string = a_string.encode('utf8')
>>> type(a_string)
<class 'bytes'>
>>> a_string
b'Cha\xc3\xaene'

Summary of rules

  • Plain text doesn't exist.
  • Use UTF8. Now. Everywhere.
  • For python2.7, in your code, specify the file encoding and declare your strings as 'unicode'.
  • On input, know the encoding of your data, and decode with decode().
  • On output, encode in the encoding expected by the system that will receive the data, or if you can't tell, in UTF8, with encode().

I know you're itching to see a real-life case, so here's a pseudo-program:

# Only needed for python2.7
# coding: utf-8

# Only needed for python2.7
# All strings are in unicode (even docstrings)
from __future__ import unicode_literals

"""
A crappy script that downloads lots of pages and saves them
    in an sqlites database.

    The performed operations are written to a log file.
"""

import re
import urllib2
import sqlite3

pages = (
    ('Snippets de Sebsauvage', 'http://www.sebsauvage.net/python/snyppets/'),
    ('Top 50 de bashfr', 'http://danstonchat.com/top50.html'),
)

# Database creation
conn = sqlite3.connect(r"backup.db")
c = conn.cursor()

try:
    c.execute('''
        CREATE TABLE pages (
            id INTEGER PRIMARY KEY,
            name TEXT,
            html TEXT
        )'''
    )
except sqlite3.OperationalError:
    pass

log = open('backup.log', 'wa')

for name, page in pages:

    # this is a very fragile way of downloading and
    # parsing HTML. Use scrapy and beautifulsoup instead
    # if you're doing a real crawler
    response = urllib2.urlopen(page)
    html = response.read(100000)

    # I retrieve the encoding on the fly
    encoding = re.findall(r'<meta.*?charset=["\']*(.+?)["\'>]', html, flags=re.I)[0]

    # html becomes unicode
    html = html.decode(encoding)

    # here I can make various treatments with my chain
    # and at the end of the program...

    # the sqlite lib converts all unicode objects to UTF8 by default
    # because this is sqlite's default encoding, so passing strings
    # unicode works, and all strings in my program are in unicode
    # thanks to my first import
    c.execute("""INSERT INTO pages (name, html) VALUES (?, ?)""", (name, html))

    # I write in my file in UTF8 because that's what I want to be able to read
    # later
    msg = "Page '{}' saved\n".format(name)
    log.write(msg.encode('utf8'))

    # note that if I don't encode(), either:
    # - I have a 'unicode' object and it crashes
    # - I have a 'str' object and it will work, but my file will contain
    # the encoding of the initial string (which here would also be UTF8, but
    # that's not always the case)

conn.commit()
c.close()

log.close()

A few tips (more relevant to python 2.7)

Some libraries accept both 'unicode' and 'str' objects:

python2.7
>>> from logging import basicConfig, getLogger
>>> basicConfig()
>>> log = getLogger()
>>> log.warn("Détécé")
WARNING:root:Détécé
>>> log.warn(u"Détécé")
WARNING:root:Détécé

And that's not necessarily a good thing, because if it's written to a log file afterwards, it can cause problems.

Others need clarification:

python2.7
>>> import re
>>> import re
>>> re.search('é', 'télé')
<_sre.SRE_Match object at 0x7fa4d3f77238>
>>> re.search(u'é', u'télé', re.UNICODE)
<_sre.SRE_Match object at 0x7fa4d3f772a0>

The re module, for example, will produce biased results on a 'unicode' string if the re.UNICODE flag is not specified.

Others do not accept 'str' objects:

python2.7
>>> import io
>>> >>> io.StringIO(u'é')
<_io.StringIO object at 0x14a96d0>
>>> io.StringIO(u'é'.encode('utf8'))
Traceback (most recent call last):
  File "<ipython-input-5-16988a0d4ac4>", line 1, in <module>
    io.StringIO('é'.encode('utf8'))
TypeError: initial_value must be unicode or None, not str

Others do not accept unicode objects:

python2.7
>>> import base64
>>> base64.encodestring('é'.encode('utf8'))
'w6k=\n'
>>> base64.encodestring(u'é')
Traceback (most recent call last):
  File "<ipython-input-3-1714982ca68e>", line 1, in <module>
    base64.encodestring('é')
  File "/usr/lib/python2.7/base64.py", line 315, in encodestring
    pieces.append(binascii.b2a_base64(chunk))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 0: ordinal not in range(128)

This may be for performance reasons (some operations are faster on a 'str' object), or for historical reasons, ignorance or laziness.

You can't guess in advance. It's often written in the documentation, otherwise you'll have to test it in the shell.

A well-designed library will ask for unicode and return unicode, freeing your mind. For example, requests and the Django ORM do this, and communicate with the rest of the world (in this case the Web and databases) in the best possible encoding, automatically and transparently. When this is possible, of course, sometimes you'll have to force the encoding because the supplier of your data is declaring the wrong one. There's nothing you can do about it, it's the same for every language in the world.

Finally, there are shortcuts for certain operations, so use them whenever possible. For example, to read a file, instead of doing a simple open(), you can do :

from codecs import open

# the codec open() has exactly the same API, including with "with"
f = open('file', encoding='encoding')

Recovered strings will be automatically converted to 'unicode' objects instead of the 'str' objects you would have had to convert by hand.

Last-chance tools

If you don't know how to encode your inputs or outputs, you still have a few options left.

Be aware, however, that these options are hacks, things to try when everything described above has gone wrong.

If you do your job right, this shouldn't happen very often. Once or twice a year max, unless you're working in a very, very crappy environment.

First, let's talk about the inputs.

If you receive an object and can't find the encoding, you can force imperfect decoding with decode() by specifying the errors parameter.

It can take the following values:

  • 'strict': raise an exception on error. This is the default behavior.
  • 'ignore': any character that causes an error is ignored.
  • 'replace': any character that causes an error is replaced by a question mark.
python3
>>> utf8_bytes = 'Père Noël'.encode('utf8')
>>> print(utf8_bytes.decode('ascii'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1: ordinal not in range(128)
>>> print(utf8_bytes.decode('ascii', errors='ignore'))
Pre Nol
>>> print(utf8_bytes.decode('ascii', errors='replace'))
P��re No��l

Mozilla also comes to the rescue with its chardet lib, which you need to install :

pip install chardet

And which TRIES (from the verb 'to try', and therefore can fail and be wrong) to detect the encoding used.

python3
>>> import chardet
>>> chardet.detect(u'Le Père Noël est une ordure'.encode('utf8'))
{'encoding': 'utf-8', 'confidence': 0.7525, 'language': ''}
>>> chardet.detect(u"Le Père Noël est une ordure, j'ai dit, 'culé".encode('utf8'))
{'encoding': 'utf-8', 'confidence': 0.87625, 'language': ''}

It works pretty well, but don't expect miracles. The more text there is, the more precise it is, and the closer the confidence parameter is to 1.

Now let's talk about the output, i.e. the case where the system that's going to receive your data is a stupid enough to crash as soon as you give it anything else than ASCII.

I don't want to rat anybody out, but I'm looking at the American administration. Subtly. Insistently.

Firstly, encode() accepts the same values for errors as decode(). But as an added bonus, it accepts 'xmlcharrefreplace', very handy for XML files:

python3
>>> u"Et là-bas, tu vois, c'est la coulée du grand bronze".encode('ascii', errors='xmlcharrefreplace')
b"Et l&#224;-bas, tu vois, c'est la coul&#233;e du grand bronze"

Last but not least, you can try to obtain an acceptable text by replacing special characters with their closest ASCII equivalent.

With the Latin alphabet, this is very easy:

python3
>>> import unicodedata
>>> unicodedata.normalize('NFKD', u"éçûö").encode('ascii', 'ignore')
b'ecuo'

For more advanced stuff like Cyrillic or Mandarin, you need to install "unidecode" :

pip install unidecode
python3
>>> from unidecode import unidecode
>>> print(unidecode(u"In Russian, Moscow is written Москва"))
In Russian, Moscow is written Moskva
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment