Created
February 12, 2019 13:54
-
-
Save paulyc/4d5938330730c0108652da051d405afd to your computer and use it in GitHub Desktop.
UTF-16 = Dumbest Thing Ever
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
UTF-16 is the dumbest thing ever. It's the kind of thing only a committee could love. | |
All strings should be stored as UTF-8. | |
Supposedly UTF-16 encodes all characters as 2 bytes, so that unlike UTF-8, a string can be | |
easily indexed without having to read the whole string. | |
Except for those pesky extended/astral plane characters. Which you can't possibly hope to | |
avoid, especially considering that EMOJI are astral plane characters, requiring FOUR BYTES | |
to store in UTF-16. OR IN UTF-8! So you still have to parse the whole string to find a character | |
index due to those pesky surrogage pairs. Advantage nullified. | |
Furthermore, UTF-16 is UTF-16, except when it isn't. Because UTF-16 strings are stored | |
differently on little-endian and big-endian machines! Everyone loves to ignore big-endian | |
architectures these days, but still, who knows what you're going to get? You don't want your | |
program to crash and burn just because someone fed it a big-endian UTF-16 string, | |
and someone most definitely will try. | |
But UTF-16 strings can't be treated just like the ASCII formatting of this file in code, | |
because they are invariably full of null bytes, so we have to rewrite all our string | |
processing code that normally handles ASCII or UTF-8 just to handle UTF-16! | |
So in summary, to the responsible programmer, UTF-16 has all the disadvantages of UTF-8, | |
an extra disadvantage, and none of the advantages. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment