Skip to content

Instantly share code, notes, and snippets.

@judy2k
Last active December 31, 2015 02:09
Show Gist options
  • Save judy2k/7918703 to your computer and use it in GitHub Desktop.
Save judy2k/7918703 to your computer and use it in GitHub Desktop.
These are some notes for a potential short talk on Python & Unicode.

Python & Unicode

text = open('a_unicode_file.txt', 'r').read()
print text
print 'type:', type(text)       # str is a container for binary data
print 'bytes:', len(text)       # The number of bytes, not characters!
print ' '.join(repr(b) for b in text)
print 'first byte:', text[:1]   # Prints an invalid character!

try:
    # This will fail because it first does a decode('ascii') and the first bytes are not valid!
    print text.encode('utf-8')
except Exception as e:
    print e

print 'codecs'
print '------'
import codecs
text = codecs.open('a_unicode_file.txt', 'r', 'utf-8').read()
print text
print 'type:', type(text)
print 'chars:', len(text)
print ' '.join(repr(c) for c in text)
print 'first char:', text[:1]

Basics

An encoding is a set of rules for converting 1-or-more bytes into characters.

Unicode is not an encoding!

Unicode does not map bytes to characters! Unicode is a numeric mapping, essentially an id for each character.

UTF-8 is an encoding

UTF-8 is variable width, and is a superset of ASCII. Characters beyond ASCII are represented with 2, 3, and 4 bytes.

Character Codepoint ASCII UTF-8
A 0x41 41 41
B 0x42 42 42
0x20AC N/A E2 82 AC

How does 0x20AC == 0xE282AC?

E2 82 AC
11100010 10000010 10101100
  1. First byte begins with 1110. This means it belongs with the next 2 bytes. All following bytes will begin with 10.
  2. Removing the prefixes on each byte leaves: 0010 000010 101100 = 0x20AC

Other Common Misunderstandings

  • str does not contain text data! str contains binary data, ie. bytes.
  • Only unicode contains text, ie. characters.
  • You can't tell the encoding of a text file - only guess.

Golden Rules

  • Decode everything that comes in
  • Keep everything as unicode inside your program
  • Encode everything as it goes out (UTF-8 is safest)

Reading files

  • Use codecs.open to read text files. This will give you unicode
  • Use open to read binary files. This will give you str

Writing files

  • Use codecs.open to write text files. This means you only have to worry about encoding when you open the file for writing.
  • Use open to write binary files. The bytes in any str you write to the file will be written byte-for-byte.

Converting between str and unicode

  • str.decode(encoding) -> unicode
  • unicode.encode(encoding) -> str

Do not make a mistake!

str.encode(encoding): does decode('ascii') and then encode(encoding). If your string contains non-ascii characters, the first step will fail!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment