Skip to content

Instantly share code, notes, and snippets.

@qrli
Created January 29, 2019 13:46
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save qrli/1e9412c88dfe08f4df9fbd35b4eff7de to your computer and use it in GitHub Desktop.
Save qrli/1e9412c88dfe08f4df9fbd35b4eff7de to your computer and use it in GitHub Desktop.
Utf-8 vs Utf-16; No One True Encoding

Utf-8 vs Utf-16; No One True Encoding

It has since long come to a consensus that utf-8 is the right choice for files and network, while utf-16 can be slightly better for in-memory processing. However, the addition of emoji characters to unicode weakens the utf-16 arguments.

One of typical arguments for utf-16 is that it is more efficient for asian characters, because it is mostly 2 bytes per character instead of 3–4 bytes in utf-8. People have already pointed out that for normal files, there are also lots of metadata (e.g. HTML tags/css/js), which is ASCII, so the savings are typically cancelled out.

The Old Advantage Is Gone

The real advantage, in my opinion, is still in that utf-16 can be largely used as UCS-2, which is the 2-bytes-only predecessor of utf-16. Many people would disagree and think that’s wrong. However, those characters which requires 4 bytes in utf-16 are mostly dead characters which most people will never use, so they deserve some extra handling instead of complicating normal strings. There are so many dead characters and many are not in the current unicode standard yet, because the value is so low but there is a cost for every character.

Average devs may not even know the complexities of encoding. Instead, they just enjoy coding in high level languages based on utf-16 like JavaScript. Even when they write some code with wrong assumption that a char is a character, they may never receive a bug report in reality, unless a QA tests the quirky examples of unicode.

However, this is changed by the emoji characters, which has a much higher frequency of being used, and they are 4 bytes. So devs must add that into their test cases, and they must learn how the encoding works.

Utf-16 still has the slight in-memory processing advantage, because most characters are still single code unit, instead of having to search for all bytes of a character so often in utf-8. Emoji characters will also reduce this advantage, a bit.

The Win of UTF-8 Is an Illusion

So more and more people would think utf-8 is the one true encoding, especially it has the compatibility with ASCII. This is especially true for C/C++. However, that’s more because C/C++ are really terrible at text processing, unless you use some 3rd party libraries. The wchar_t is pretty much useless on Linux, while the char is basically a byte. Most of the std library string functions takes a byte as a character, thus not working with multibyte utf-8 characters. So, you pretend to be working with utf-8, but in fact working with bytes. However, not all byte strings are utf-8 strings, so compatibility issues come up every so often.

Python 3 tried hard to hide all unicode complexity behind a string type, and differentiate between string and bytes. However, the Linux OS inherits the C issues, and simply pass bytes to Python as strings, which cause errors like invalid string and gives Python 3’s string solution a bad fame.

The varying length of a character also introduce some issue for fixed length buffers, especially for database. For a utf-8 column of 4 bytes, we can no longer say it can store 1–4 characters. Instead, it could be 4, 2, or 1 character only, depending on the character you input. Typical solution is to reserve enough bytes to store the max length, which is a waste and has some performance drawback.

Conclusion

So, after about 20 years, string encoding is still hard. The world will not unify to a single encoding any time soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment