A guide to character encoding aware development
By Branden Clark - https://clark.re
- General guidelines
Working with different character encodings is something that I have struggled with for a few years now. Googling for answers usually gives you a solution, but only for that one particular error. I haven't found any "overview of unicode" pages, or "guidelines for working with unicode" posts, so here is my attempt at trying to create a guide for beginners as well as a reference with solutions to common issues.
What are character encodings
Just like everything else computers work with, a computer needs to be able to represent characters in a string as a sequence of bytes. Also, just like everything else, there are a bunch of competing standards for doing so.
For example, here is the "Administrator" string in Russian, encoded with several different encodings.
>>> 'Aдминистратор'.encode('utf-8') b'A\xd0\xb4\xd0\xbc\xd0\xb8\xd0\xbd\xd0\xb8\xd1\x81\xd1\x82\xd1\x80\xd0\xb0\xd1\x82\xd0\xbe\xd1\x80' >>> 'Aдминистратор'.encode('utf-16le') b'A\x004\x04<\x048\x04=\x048\x04A\x04B\x04@\x040\x04B\x04>\x04@\x04' >>> 'Aдминистратор'.encode('cp1251') b'A\xe4\xec\xe8\xed\xe8\xf1\xf2\xf0\xe0\xf2\xee\xf0' >>> 'Aдминистратор'.encode('cp866') b'A\xa4\xac\xa8\xad\xa8\xe1\xe2\xe0\xa0\xe2\xae\xe0'
Notice how the actual byte values corresponding the same string are sometimes completely different, and other times quite similar.
Why does the encoding matter
Plenty of high level editors and languages are smart enough to abstract this away from you. When you open a text file in Notepad++ or Atom, you will likely see "Aдминистратор", regardless of if the file is encoded in UTF-8 or UTF-16. As humans as long as we can read the text with our eyes that is good enough for us, and we don't care how the computer represents it. But what if this file is a configuration file, and needs to be digested and parsed by a language like C or Python?
Following the previous example, I wrote the bytes to for "Aдминистратор" to different files and tried to read them as text in Python 3.6.
>>> open('utf8.txt', 'r').read() 'Aдминистратор' >>> open('utf16.txt', 'r').read() 'A\x004\x04<\x048\x04=\x048\x04A\x04B\x04@\x040\x04B\x04>\x04@\x04' >>> open('cp1251.txt', 'r').read() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python3.6/codecs.py", line 321, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec cant decode byte 0xe4 in position 1: invalid continuation byte >>> open('cp866', 'r').read() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python3.6/codecs.py", line 321, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec cant decode byte 0xa4 in position 1: invalid start byte
Since I tried opening the files in
text mode, Python automatically tries to convert the bytes from the file into a string. Doing this requires an encoding, and for Python 3.6 if an encoding is not passed to
open() the default is
locale.getpreferredencoding()1. On my system this happens to be UTF-8, but you can not assume that is the case everywhere.
Similarly, if I write "Aдминистратор" to a file opened with
text mode, it will use my default encoding to write bytes to the file. If someone with a different default encoding then reads this file in
text mode they could get errors.
>>> open('utf8.txt', 'r').read() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python3.6/codecs.py", line 321, in decode (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-16-le' codec cant decode byte 0x80 in position 24: truncated data
This person's computer has the default encoding set to
utf-16-le but the file was encoded in UTF-8, causing Python's automatic decode to fail.
Why do these examples work fine with English
If you tried some of the above examples with an English string like "Administrator" you may have noticed they worked fine. Compatability once again coming into play here. Many character encodings use the same character/byte encodings as ASCII for the byte range (0x00-0x7F)2,3. This means that a string comprised entirely of ASCII characters will likely encode to the same byte values in different character encodings.
How do I tell what encoding a file is in
You can't, not with any real certainty. A sequence of bytes could potentially be valid for many different character encodings, even if it looks like gibberish to a human. For those who don't take no for an answer, chardet27 will attempt to detect the character encoding and provide a certainty level.
This is where most of my pain comes from. In a (successful?) attempt to support everything, Microsoft has made it too easy for applciation developers to write mixed ANSI/Wide code. In some cases mixed code is required for interoperability with other applications, libraries, or services. This, coupled with information on the topic being spread across many different MSDN API pages, some of which are incomplete or just plain wrong, has led to developers improperly handling character encodings in their applications. Thankfully, in recent versions of Windows 10 Microsoft is finally leaning away from ANSI and towards UTF-8.
A vs W APIs
There are a lot of APIs in Windows that come in an "A" and a "W" version (e.g. CreateFileA, CreateFileW). The "A" APIs are not re-implementations of the "W" APIs with a different character encoding. In general, under the hood they convert your string inputs from ANSI to UTF-16, call the "W" API, and then convert string outputs from UTF-16 to ANSI.
ASCII vs ANSI
I have had the difference between the "A" and "W" APIs explained to me by many different developers as "ASCII and Wide" APIs. This is wrong, the "A" stands for ANSI4 and this distinction is important.
While ANSI is sometimes used to refer to the character encoding of the Latin alphabet5, the term is used broadly in Windows to refer to the Windows code pages6. You can find a list of the Windows code pages on MSDN7. Much like our earlier Python example using the "Aдминистратор" string, if you read UTF-8 bytes from a file and pass them to an "A" function you might get an error, or things might "work", but not as you want.
What does wide mean
The ANSI code pages are designed to support only a limited character set. In other words, you won't be able to encode Russian characters with the US English code page, or Chinese characters with the Russian code page. This is obviously an issue because we need some way of reading other languages on our screens. Wide characters allow this. When Windows documentation refers to Unicode or Wide characters they usually mean UTF-16 encoded characters12. UTF-16 can encode nearly every character from nearly every language.
What is the size of a character
Many will say that ANSI characters are 1 byte and Wide characters are two bytes. While this is true most of the time, Windows also supports "Double-byte character sets"11, ANSI code pages used with the "A" functions where some characters take two bytes to encode. This is to accomadate the large number of characters used by east asian languages like Japanse and Chinese. In addition, UTF-16 characters are not always two bytes. Even two bytes cannot cover every symbol used by every language, thus "Surrogates and Supplementary Characters"13 were created and introduced 4 byte (32-bit) characters.
Thankfully, when trying to determine what size buffer to use to hold a string we do not need to worry about this. MSDN pages referring to a length parameter will usually desribe it as the "number of characters", but they don't mean characters in the sense as described above.
Each of these functions takes a length count. For the "ANSI" version of each function, the length is specified as a BYTE count length of a string not including the NULL terminator. For the Unicode function, the length count is the byte count divided by sizeof(WCHAR), which is 2, not including the NULL terminator.28
Which code page do the A APIs use
The "A" APIs use the sytems ANSI code page. The systems ANSI code page is configured when you change the system locale. Per MSDN,
GetACP()8 will return the current Windows ANSI code page for the system.
Fun note: While MSDN claims the "A" APIs use
GetACP()8, they don't actually ever call
GetACP(). They call
RtlAnsiStringToUnicodeString() which references a global in ntdll that contains the current code page.
GetACP() returns the value of a different global, this time in kernelbase. In any case, we can hope that Microsoft will update both globals as appropriate.
The console uses a different code page
Unfortunately, Windows does not have just one code page. In addition to the ANSI code page which is returned by
GetACP()8; the code page used by the console is the OEM code page and is returned by
You can also get the OEM code page by running
chcp in a console:
>chcp Active code page: 437
Note that it is different than my ANSI code page.
>>> import ctypes >>> ctypes.windll.kernel32.GetACP() 1252
While this will not cause you too much trouble with US English, other locales can have issues. Back to our Russian example:
>>> 'Aдминистратор'.encode('cp1251') # ANSI code page b'A\xe4\xec\xe8\xed\xe8\xf1\xf2\xf0\xe0\xf2\xee\xf0' >>> 'Aдминистратор'.encode('cp866') # OEM code page b'A\xa4\xac\xa8\xad\xa8\xe1\xe2\xe0\xa0\xe2\xae\xe0'
This also means that batch (.bat) files must be OEM encoded in order for CMD to execute them properly.
%s and %S format specifiers
Windows provides ANSI and Wide versions of all their string formatting and printing functions18. The following applies to this whole family of functions, including vsnprintf, etc.
Microsoft provides the format specifiers
%S for dealing with ANSI and Wide strings. Unfortunately the meaning changes depending on which function you use:
printf("%s", ansi_string); printf("%S", wide_string); wprintf(L"%s", wide_string); wprintf(L"%S", ansi_string);
This differes from standard C behavior where
%s takes an ANSI string in both printf and wprintf, and
%S isn't a valid specifier25.
What's worse is the function19 and format specifier18 MSDN pages don't deem it relevant to include that the
%S format specifiers don't just "accept" ANSI and Wide strings, they perform a conversion similar to the "A" functions (but using mbstowcs/wcstombs) that is dependent on the systems configured ANSI code page. Since the ANSI code pages are limited and language/region specific, your string might not convert to or from Wide properly. This conversion behavior is documented for standard C25.
These functions do have
_l versions that enable you to pass a locale26 to use for conversion if the string is not encoded with your systems ANSI code page. However, until recently20 creating a UTF8 locale was not supported.
The set of available languages, country/region codes, and code pages includes all those supported by the Win32 NLS API except code pages that require more than two bytes per character, such as UTF-7 and UTF-8. If you provide a code page like UTF-7 or UTF-8, setlocale will fail, returning NULL.21
I don't know when this behavior changed or if it changed with an SDK version or a Windows verison. They just removed all mention of it from the MSDN page.
To avoid confusion between
%S, and to be consistent, you can use size prefixes to make it clear you expect an ANSI vs a Wide string, regardless of whether you are using printf or wprintf24.
%hS- always an ANSI string (MSVC extension, not ISO C compatabile)
%wS- always a Wide string (MSVC extension, not ISO C compatabile)
%lS- always a Wide string
MAX_PATH isn't the max
Python has a few "gotchas" as well.
encode vs decode
encode: convert string to bytes decode: convert bytes to string
In Python 2 it is easy to get confused because the string type
str is also the bytes type, and you can call both
decode on it, and more than once in a row.
>>> 'Aдминистратор'.encode('utf-8').encode('utf-8') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'ascii' codec cant decode byte 0xd0 in position 1: ordinal not in range(128)
While this often results in a
UnicodeDecodeError, it doesn't always.
>>> 'banana'.encode('utf-16le').encode('utf-8') 'b\x00a\x00n\x00a\x00n\x00a\x00' >>> 'banana'.decode('utf-8').decode('utf-8') u'banana'
Notice how the string we got back from
decode is surrounded by
u''. This denotes that we got back a
unicode object10. Since
unicode objects use UTF-16 under the hood, they can also be used for many different languages. In addition, since
unicode is a different type than
str it gives you a way to differentiate if you are working with encoded or decoded data.
Thankfully, this confusion is resolved in Python 3 by having a dedicated
bytes30 type which you can only decode (giving you a
str object), and a unified string type
str which you can only encode (giving you a
bytes object). Though the built-in function
unicode still exists, the type was removed as its previous purpose is now covered by
A side effect of making the encode/decode procedure sane in Python 3 is you can no longer use
.decode('hex') for working with hex strings. Since nobody wants to go through the trouble of importing
binascii, Python 3.5 added
fromhex() to the
Unfortunately not everything in Python behaves the same way if an encoding is not specifed. Sometimes the default is
locale.getpreferredencoding()15 and other times it is
utf-816. So make sure to double check the docs and err on specifying the encoding you want.
(a.k.a "why is python changing my file")
By default, when you open a file in text mode Python will "translate" any newlines you read/write to/from
os.linesep1,17. Meaning if you are running on Windows and you write "hello!\n", Python will automatically change this and instead write "hello!\r\n". New in Python 3, you can disable this behavior for a file by setting
newline=''1. In Python 2 you must open the file in binary mode.
NOTE: Python automatically opens stdin/stdout/stderr in text mode, which subjects each to universal newline translation.
Windows code pages
You can encode and decode in Python with any of the Windows code pages. In general the name is just
cp followed by the code page number (e.g. "cp1251"). The
codecs module documentation has a complete list14.
If running on a Windows system, Python aliases "mbcs" to the system ANSI code page for convenience.
utf-16 vs utf-16-le
Some encodings include a Byte-Order Mark (BOM). The BOM is used to indicate the endianess of the character encoding, and when included will be the first two bytes of file.
little endian: \xff\xfe big endian: \xfe\xff
In this example, you can see the byte order is swapped between
utf-16be, and that on my system
utf-16 encodes the same as
utf-16le, but includes the little endian BOM at the start.
>>> "Aдминистратор".encode('utf-16le') b'A\x004\x04<\x048\x04=\x048\x04A\x04B\x04@\x040\x04B\x04>\x04@\x04' >>> "Aдминистратор".encode('utf-16be') b'\x00A\x044\x04<\x048\x04=\x048\x04A\x04B\x04@\x040\x04B\x04>\x04@' >>> "Aдминистратор".encode('utf-16') b'\xff\xfeA\x004\x04<\x048\x04=\x048\x04A\x04B\x04@\x040\x04B\x04>\x04@\x04'
subprocess does not support unicode args
This is Windows only and is fixed in Python 3+, but still affects Python 2.7. Under the hood, subprocess calls
CreateProcessA. This means that your parameters need to be encoded with the systems ANSI code page, so passing a string encoded differently to
subprocess.Popen could fail29.
As long as your command only contains characters that are valid for the systems ANSI code page you can work around this issue by encoding your command in the systems ANSI code page:
new_cmd = cmd.encode('mbcs')
- Try to stick to one encoding for the main logic of your program
- Avoid language/region specific encodings for your primary encoding
- Use either UTF-8 or UTF-16 and convert where necessary
- Minimize points where encoding conversions need to take place
- On Windows, use Wide/UTF-16 and the "W" functions.
- Be explicit wherever you can
- Open binary files in binary mode
- Pass the 'encoding' parameter
- Document the encoding of any inputs/outputs
This work is licensed under a Creative Commons Attribution 4.0 International License.