clarkb7/unicode.md

## unicode.md

      
    Raw
  

              unicode.md
            
          
    A guide to character encoding aware development

By Branden Clark - https://clark.re

Intro

What are character encodings
Why does the encoding matter
Why do these examples work fine with English
How do I tell what encoding a file is in


Windows

A vs W APIs

ASCII vs ANSI


What does wide mean
What is the size of a character
Which code page do the A APIs use
The console uses a different code page
%s and %S format specifiers
MAX_PATH isn't the max


Python

encode vs decode
Default encodings
Univeral newlines
Windows code pages
utf-16 vs utf-16-le
subprocess does not support unicode args


General guidelines
References
License

Intro

Working with different character encodings is something that I have struggled with for a few years now. Googling for answers usually gives you a solution, but only for that one particular error. I haven't found any "overview of unicode" pages, or "guidelines for working with unicode" posts, so here is my attempt at trying to create a guide for beginners as well as a reference with solutions to common issues.
What are character encodings

Just like everything else computers work with, a computer needs to be able to represent characters in a string as a sequence of bytes. Also, just like everything else, there are a bunch of competing standards for doing so.
For example, here is the "Administrator" string in Russian, encoded with several different encodings.
>>> 'Aдминистратор'.encode('utf-8')
b'A\xd0\xb4\xd0\xbc\xd0\xb8\xd0\xbd\xd0\xb8\xd1\x81\xd1\x82\xd1\x80\xd0\xb0\xd1\x82\xd0\xbe\xd1\x80'
>>> 'Aдминистратор'.encode('utf-16le')
b'A\x004\x04<\x048\x04=\x048\x04A\x04B\x04@\x040\x04B\x04>\x04@\x04'
>>> 'Aдминистратор'.encode('cp1251')
b'A\xe4\xec\xe8\xed\xe8\xf1\xf2\xf0\xe0\xf2\xee\xf0'
>>> 'Aдминистратор'.encode('cp866')
b'A\xa4\xac\xa8\xad\xa8\xe1\xe2\xe0\xa0\xe2\xae\xe0'
Notice how the actual byte values corresponding the same string are sometimes completely different, and other times quite similar.
Why does the encoding matter

Plenty of high level editors and languages are smart enough to abstract this away from you. When you open a text file in Notepad++ or Atom, you will likely see "Aдминистратор", regardless of if the file is encoded in UTF-8 or UTF-16. As humans as long as we can read the text with our eyes that is good enough for us, and we don't care how the computer represents it. But what if this file is a configuration file, and needs to be digested and parsed by a language like C or Python?
Following the previous example, I wrote the bytes to for "Aдминистратор" to different files and tried to read them as text in Python 3.6.
>>> open('utf8.txt', 'r').read()
'Aдминистратор'
>>> open('utf16.txt', 'r').read()
'A\x004\x04<\x048\x04=\x048\x04A\x04B\x04@\x040\x04B\x04>\x04@\x04'
>>> open('cp1251.txt', 'r').read()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
 UnicodeDecodeError: 'utf-8' codec cant decode byte 0xe4 in position 1: invalid continuation byte
>>> open('cp866', 'r').read()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec cant decode byte 0xa4 in position 1: invalid start byte
Since I tried opening the files in text mode, Python automatically tries to convert the bytes from the file into a string. Doing this requires an encoding, and for Python 3.6 if an encoding is not passed to open() the default is locale.getpreferredencoding()¹. On my system this happens to be UTF-8, but you can not assume that is the case everywhere.
Similarly, if I write "Aдминистратор" to a file opened with text mode, it will use my default encoding to write bytes to the file. If someone with a different default encoding then reads this file in text mode they could get errors.
>>> open('utf8.txt', 'r').read()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-16-le' codec cant decode byte 0x80 in position 24: truncated data
This person's computer has the default encoding set to utf-16-le but the file was encoded in UTF-8, causing Python's automatic decode to fail.
Why do these examples work fine with English

If you tried some of the above examples with an English string like "Administrator" you may have noticed they worked fine. Compatability once again coming into play here. Many character encodings use the same character/byte encodings as ASCII for the byte range (0x00-0x7F)^2,3. This means that a string comprised entirely of ASCII characters will likely encode to the same byte values in different character encodings.
How do I tell what encoding a file is in

You can't, not with any real certainty. A sequence of bytes could potentially be valid for many different character encodings, even if it looks like gibberish to a human. For those who don't take no for an answer, chardet²⁷ will attempt to detect the character encoding and provide a certainty level.
Windows

This is where most of my pain comes from. In a (successful?) attempt to support everything, Microsoft has made it too easy for applciation developers to write mixed ANSI/Wide code. In some cases mixed code is required for interoperability with other applications, libraries, or services. This, coupled with information on the topic being spread across many different MSDN API pages, some of which are incomplete or just plain wrong, has led to developers improperly handling character encodings in their applications. Thankfully, in recent versions of Windows 10 Microsoft is finally leaning away from ANSI and towards UTF-8.
A vs W APIs

There are a lot of APIs in Windows that come in an "A" and a "W" version (e.g. CreateFileA, CreateFileW). The "A" APIs are not re-implementations of the "W" APIs with a different character encoding. In general, under the hood they convert your string inputs from ANSI to UTF-16, call the "W" API, and then convert string outputs from UTF-16 to ANSI.
ASCII vs ANSI

I have had the difference between the "A" and "W" APIs explained to me by many different developers as "ASCII and Wide" APIs. This is wrong, the "A" stands for ANSI⁴ and this distinction is important.
While ANSI is sometimes used to refer to the character encoding of the Latin alphabet⁵, the term is used broadly in Windows to refer to the Windows code pages⁶. You can find a list of the Windows code pages on MSDN⁷. Much like our earlier Python example using the "Aдминистратор" string, if you read UTF-8 bytes from a file and pass them to an "A" function you might get an error, or things might "work", but not as you want.
What does wide mean

The ANSI code pages are designed to support only a limited character set. In other words, you won't be able to encode Russian characters with the US English code page, or Chinese characters with the Russian code page. This is obviously an issue because we need some way of reading other languages on our screens. Wide characters allow this. When Windows documentation refers to Unicode or Wide characters they usually mean UTF-16 encoded characters¹². UTF-16 can encode nearly every character from nearly every language.
What is the size of a character

Many will say that ANSI characters are 1 byte and Wide characters are two bytes. While this is true most of the time, Windows also supports "Double-byte character sets"¹¹, ANSI code pages used with the "A" functions where some characters take two bytes to encode. This is to accomadate the large number of characters used by east asian languages like Japanse and Chinese. In addition, UTF-16 characters are not always two bytes. Even two bytes cannot cover every symbol used by every language, thus "Surrogates and Supplementary Characters"¹³ were created and introduced 4 byte (32-bit) characters.
Thankfully, when trying to determine what size buffer to use to hold a string we do not need to worry about this. MSDN pages referring to a length parameter will usually desribe it as the "number of characters", but they don't mean characters in the sense as described above.

Each of these functions takes a length count. For the "ANSI" version of each function, the length is specified as a BYTE count length of a string not including the NULL terminator. For the Unicode function, the length count is the byte count divided by sizeof(WCHAR), which is 2, not including the NULL terminator.²⁸

Which code page do the A APIs use

The "A" APIs use the sytems ANSI code page. The systems ANSI code page is configured when you change the system locale. Per MSDN, GetACP()⁸ will return the current Windows ANSI code page for the system.
Fun note: While MSDN claims the "A" APIs use GetACP()⁸, they don't actually ever call GetACP(). They call RtlAnsiStringToUnicodeString() which references a global in ntdll that contains the current code page. GetACP() returns the value of a different global, this time in kernelbase. In any case, we can hope that Microsoft will update both globals as appropriate.
The console uses a different code page

Unfortunately, Windows does not have just one code page. In addition to the ANSI code page which is returned by GetACP()⁸; the code page used by the console is the OEM code page and is returned by GetOEMCP()⁹.
You can also get the OEM code page by running chcp in a console:
>chcp
Active code page: 437
Note that it is different than my ANSI code page.
>>> import ctypes
>>> ctypes.windll.kernel32.GetACP()
1252
While this will not cause you too much trouble with US English, other locales can have issues. Back to our Russian example:
>>> 'Aдминистратор'.encode('cp1251') # ANSI code page
b'A\xe4\xec\xe8\xed\xe8\xf1\xf2\xf0\xe0\xf2\xee\xf0'
>>> 'Aдминистратор'.encode('cp866') # OEM code page
b'A\xa4\xac\xa8\xad\xa8\xe1\xe2\xe0\xa0\xe2\xae\xe0'
This also means that batch (.bat) files must be OEM encoded in order for CMD to execute them properly.
%s and %S format specifiers

Windows provides ANSI and Wide versions of all their string formatting and printing functions¹⁸. The following applies to this whole family of functions, including vsnprintf, etc.
Microsoft provides the format specifiers %s and %S for dealing with ANSI and Wide strings. Unfortunately the meaning changes depending on which function you use:
printf("%s", ansi_string);
printf("%S", wide_string);
wprintf(L"%s", wide_string);
wprintf(L"%S", ansi_string);
This differes from standard C behavior where %s takes an ANSI string in both printf and wprintf, and %S isn't a valid specifier²⁵.
What's worse is the function¹⁹ and format specifier¹⁸ MSDN pages don't deem it relevant to include that the %s and %S format specifiers don't just "accept" ANSI and Wide strings, they perform a conversion similar to the "A" functions (but using mbstowcs/wcstombs) that is dependent on the systems configured ANSI code page. Since the ANSI code pages are limited and language/region specific, your string might not convert to or from Wide properly. This conversion behavior is documented for standard C²⁵.
These functions do have _l versions that enable you to pass a locale²⁶ to use for conversion if the string is not encoded with your systems ANSI code page. However, until recently²⁰ creating a UTF8 locale was not supported.

The set of available languages, country/region codes, and code pages includes all those supported by the Win32 NLS API except code pages that require more than two bytes per character, such as UTF-7 and UTF-8. If you provide a code page like UTF-7 or UTF-8, setlocale will fail, returning NULL.²¹

I don't know when this behavior changed or if it changed with an SDK version or a Windows verison. They just removed all mention of it from the MSDN page.
To avoid confusion between %s and %S, and to be consistent, you can use size prefixes to make it clear you expect an ANSI vs a Wide string, regardless of whether you are using printf or wprintf²⁴.

%hs, %hS - always an ANSI string (MSVC extension, not ISO C compatabile)
%ws, %wS - always a Wide string (MSVC extension, not ISO C compatabile)
%ls, %lS - always a Wide string

MAX_PATH isn't the max

By now many of us are used to assuming MAX_PATH (260) is a good character limit for files paths on Windows. This limit is expanded to 32,767 characters for paths given to unicode functions^22,23.
Python

Python has a few "gotchas" as well.
encode vs decode

In short:
encode: convert string to bytes
decode: convert bytes to string

In Python 2 it is easy to get confused because the string type str is also the bytes type, and you can call both encode and decode on it, and more than once in a row.
>>> 'Aдминистратор'.encode('utf-8').encode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec cant decode byte 0xd0 in position 1: ordinal not in range(128)
While this often results in a UnicodeDecodeError, it doesn't always.
>>> 'banana'.encode('utf-16le').encode('utf-8')
'b\x00a\x00n\x00a\x00n\x00a\x00'
>>> 'banana'.decode('utf-8').decode('utf-8')
u'banana'
Notice how the string we got back from decode is surrounded by u''. This denotes that we got back a unicode object¹⁰. Since unicode objects use UTF-16 under the hood, they can also be used for many different languages. In addition, since unicode is a different type than str it gives you a way to differentiate if you are working with encoded or decoded data.
Thankfully, this confusion is resolved in Python 3 by having a dedicated bytes³⁰ type which you can only decode (giving you a str object), and a unified string type str which you can only encode (giving you a bytes object). Though the built-in function unicode still exists, the type was removed as its previous purpose is now covered by str.
A side effect of making the encode/decode procedure sane in Python 3 is you can no longer use .encode('hex') or .decode('hex') for working with hex strings. Since nobody wants to go through the trouble of importing binascii, Python 3.5 added hex() and fromhex() to the bytes type³¹.
Default encodings

Unfortunately not everything in Python behaves the same way if an encoding is not specifed. Sometimes the default is locale.getpreferredencoding()¹⁵ and other times it is utf-8¹⁶. So make sure to double check the docs and err on specifying the encoding you want.
Univeral newlines

(a.k.a "why is python changing my file")
By default, when you open a file in text mode Python will "translate" any newlines you read/write to/from os.linesep^1,17. Meaning if you are running on Windows and you write "hello!\n", Python will automatically change this and instead write "hello!\r\n". New in Python 3, you can disable this behavior for a file by setting newline=''¹. In Python 2 you must open the file in binary mode.
NOTE: Python automatically opens stdin/stdout/stderr in text mode, which subjects each to universal newline translation.
Windows code pages

You can encode and decode in Python with any of the Windows code pages. In general the name is just cp followed by the code page number (e.g. "cp1251"). The codecs module documentation has a complete list¹⁴.
If running on a Windows system, Python aliases "mbcs" to the system ANSI code page for convenience.
utf-16 vs utf-16-le

Some encodings include a Byte-Order Mark (BOM). The BOM is used to indicate the endianess of the character encoding, and when included will be the first two bytes of file.
little endian: \xff\xfe
big endian: \xfe\xff

In this example, you can see the byte order  is swapped between utf-16le and utf-16be, and that on my system utf-16 encodes the same as utf-16le, but includes the little endian BOM at the start.
>>> "Aдминистратор".encode('utf-16le')
b'A\x004\x04<\x048\x04=\x048\x04A\x04B\x04@\x040\x04B\x04>\x04@\x04'
>>> "Aдминистратор".encode('utf-16be')
b'\x00A\x044\x04<\x048\x04=\x048\x04A\x04B\x04@\x040\x04B\x04>\x04@'
>>> "Aдминистратор".encode('utf-16')
b'\xff\xfeA\x004\x04<\x048\x04=\x048\x04A\x04B\x04@\x040\x04B\x04>\x04@\x04'
subprocess does not support unicode args

This is Windows only and is fixed in Python 3+, but still affects Python 2.7. Under the hood, subprocess calls CreateProcessA. This means that your parameters need to be encoded with the systems ANSI code page, so passing a string encoded differently to subprocess.Popen could fail²⁹.
As long as your command only contains characters that are valid for the systems ANSI code page you can work around this issue by encoding your command in the systems ANSI code page:
new_cmd = cmd.encode('mbcs')
General guidelines


Try to stick to one encoding for the main logic of your program

Avoid language/region specific encodings for your primary encoding
Use either UTF-8 or UTF-16 and convert where necessary


Minimize points where encoding conversions need to take place
On Windows, use Wide/UTF-16 and the "W" functions.
Be explicit wherever you can

Open binary files in binary mode
Pass the 'encoding' parameter
Document the encoding of any inputs/outputs


References

[1] https://docs.python.org/3.6/library/functions.html#open
[2] https://linux.die.net/man/7/utf8
[3] https://docs.microsoft.com/en-us/cpp/text/locales-and-code-pages?view=vs-2019
[4] https://docs.microsoft.com/en-us/windows/win32/learnwin32/working-with-strings
[5] https://en.wikipedia.org/wiki/ANSI_character_set
[6] https://docs.microsoft.com/en-us/windows/win32/intl/code-pages
[7] https://docs.microsoft.com/en-us/windows/win32/intl/code-page-identifiers
[8] https://docs.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-getacp
[9] https://docs.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-getoemcp
[10] https://docs.python.org/2/library/functions.html#unicode
[11] https://docs.microsoft.com/en-us/windows/win32/intl/double-byte-character-sets
[12] https://docs.microsoft.com/en-us/windows/win32/intl/unicode
[13] https://docs.microsoft.com/en-us/windows/win32/intl/surrogates-and-supplementary-characters
[14] https://docs.python.org/3/library/codecs.html#standard-encodings
[15] https://docs.python.org/3/library/locale.html#locale.getpreferredencoding
[16] https://docs.python.org/3/library/codecs.html?highlight=encode#codecs.encode
[17] https://docs.python.org/3/library/os.html#os.linesep
[18] https://docs.microsoft.com/en-us/cpp/c-runtime-library/format-specification-syntax-printf-and-wprintf-functions?view=vs-2019#type-field-characters
[19] https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/printf-printf-l-wprintf-wprintf-l?view=vs-2019
[20] https://github.com/MicrosoftDocs/cpp-docs/issues/1469
[21] https://docs.microsoft.com/en-us/previous-versions/visualstudio/visual-studio-2010/x99tb11d(v%3Dvs.100)
[22] https://docs.microsoft.com/en-us/cpp/c-runtime-library/path-field-limits?view=vs-2019
[23] https://docs.microsoft.com/en-us/windows/win32/fileio/naming-a-file?redirectedfrom=MSDN#maximum-path-length-limitation
[24] https://docs.microsoft.com/en-us/cpp/c-runtime-library/format-specification-syntax-printf-and-wprintf-functions?view=vs-2019#size-prefixes-for-printf-and-wprintf-format-type-specifiers
[25] https://en.cppreference.com/w/c/io/fwprintf
[26] https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/create-locale-wcreate-locale?view=vs-2019
[27] https://pypi.org/project/chardet/
[28] https://docs.microsoft.com/en-us/windows/win32/gdi/specifying-length-of-text-output-string
[29] https://bugs.python.org/issue19264
[30] https://docs.python.org/3/library/stdtypes.html#bytes
[31] https://docs.python.org/3/library/stdtypes.html#bytes.fromhex
License


This work is licensed under a Creative Commons Attribution 4.0 International License.