Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Python 2.7. Unicode Errors Simply Explained

Python 2.7. Unicode Errors Simply Explained

I know I'm late with this article for about 5 years or so, but people are still using Python 2.x, so this subject is relevant I think.

Some facts first:

  • Unicode is an international encoding standard for use with different languages and scripts
  • In python-2.x, there are two types that deal with text.
    1. str is an 8-bit string.
    2. unicode is for strings of unicode code points.
      A code point is a number that maps to a particular abstract character. It is written using the notation U+12ca to mean the character with value 0x12ca (4810 decimal)
  • Encoding (noun) is a map of Unicode code points to a sequence of bytes. (Synonyms: character encoding, character set, codeset). Popular encodings: UTF-8, ASCII, Latin-1, etc.
  • Encoding (verb) is a process of converting unicode to bytes of str, and decoding is the reverce operation.
  • Python 2.x uses ASCII as a default encoding. (More about this later)

SyntaxError: Non-ASCII character

When you sees something like this

SyntaxError: Non-ASCII character '\xd0' in file /tmp/p.py on line 2, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details

you just need to define encoding in the first or second line of your file. All you need is to have string coding=utf8 or coding: utf8 somewhere in your comments. Python doesn't care what goes before or after those string, so the following will work fine too:

# -*- encoding: utf-8 -*-

Notice the dash in utf-8. Python has many aliases for UTF-8 encoding, so you should not worry about dashes or case sensitivity.

UnicodeEncodeError Explained

>>> str(u'café')
Traceback (most recent call last):
  File "<input>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 3: ordinal not in range(128)

str() function encodes a string. We passed a unicode string, and it tried to encode it using a default encoding, which is ASCII. Now the error makes sence because ASCII is 7-bit encoding which doesn't know how to represent characters outside of range 0..128.
Here we called str() explicitly, but something in your code may call it implicitly and you will also get UnicodeEncodeError.

How to fix: encode unicode string manually using .encode('utf8') before passing to str()

UnicodeDecodeError Explained

>>> utf_string = u'café'
>>> byte_string = utf_string.encode('utf8')
>>> unicode(byte_string)
Traceback (most recent call last):
  File "<input>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 3: ordinal not in range(128)

Let's say we somehow obtained a byte string byte_string which contains encoded UTF-8 characters. We could get this by simply using a library that returns str type.
Then we passed the string to a function that converts it to unicode. In this example we explicitly call unicode(), but some functions may call it implicitly and you'll get the same error.
Now again, Python uses ASCII encoding by default, so it tries to convert bytes to a default encoding ASCII. Since there is no ASCII symbol that converts to 0xc3 (195 decimal) it fails with UnicodeDecodeError.

How to fix: decode str manually using .decode('utf8') before passing to your function.

Rule of Thumb

Make sure your code works only with Unicode strings internally, converting to a particular encoding on output, and decoding str on input. Learn the libraries you are using, and find places where they return str. Decode str before return value is passed further in your code.

I use this helper function in my code:

def force_to_unicode(text):
    "If text is unicode, it is returned as is. If it's str, convert it to Unicode using UTF-8 encoding"
    return text if isinstance(text, unicode) else text.decode('utf8')

Source: https://docs.python.org/2/howto/unicode.html

@BloodWind-NexR

This comment has been minimized.

Copy link

@BloodWind-NexR BloodWind-NexR commented Oct 21, 2016

Good information to me.
Thank you.

@svilella

This comment has been minimized.

Copy link

@svilella svilella commented Jan 13, 2018

The helper function is very... helpful! Thanks a lot.

@samruben

This comment has been minimized.

Copy link

@samruben samruben commented Feb 13, 2018

Awsome explanation!

@r-pankevicius

This comment has been minimized.

Copy link

@r-pankevicius r-pankevicius commented Feb 19, 2018

There is Python built-in function unicode() that acts the same as str(), it handles not only string but exceptions... That's for Python 2, but Python 3 doesn't have this function; they have changed unicode handling.

@luisfranciscocesar

This comment has been minimized.

Copy link

@luisfranciscocesar luisfranciscocesar commented May 2, 2018

not working

@rbravo86

This comment has been minimized.

Copy link

@rbravo86 rbravo86 commented May 29, 2018

Thank you very much

@ghost

This comment has been minimized.

Copy link

@ghost ghost commented Jul 22, 2018

what about this?
Help plss
ansii_problem

@vanwars

This comment has been minimized.

Copy link

@vanwars vanwars commented Aug 15, 2018

This is great. Thank you.

@TSMaitry

This comment has been minimized.

Copy link

@TSMaitry TSMaitry commented Dec 21, 2018

Wow, it's working. Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment