Skip to content

Instantly share code, notes, and snippets.

@paulyc
Created February 12, 2019 13:54
Show Gist options
  • Save paulyc/4d5938330730c0108652da051d405afd to your computer and use it in GitHub Desktop.
Save paulyc/4d5938330730c0108652da051d405afd to your computer and use it in GitHub Desktop.
UTF-16 = Dumbest Thing Ever
UTF-16 is the dumbest thing ever. It's the kind of thing only a committee could love.
All strings should be stored as UTF-8.
Supposedly UTF-16 encodes all characters as 2 bytes, so that unlike UTF-8, a string can be
easily indexed without having to read the whole string.
Except for those pesky extended/astral plane characters. Which you can't possibly hope to
avoid, especially considering that EMOJI are astral plane characters, requiring FOUR BYTES
to store in UTF-16. OR IN UTF-8! So you still have to parse the whole string to find a character
index due to those pesky surrogage pairs. Advantage nullified.
Furthermore, UTF-16 is UTF-16, except when it isn't. Because UTF-16 strings are stored
differently on little-endian and big-endian machines! Everyone loves to ignore big-endian
architectures these days, but still, who knows what you're going to get? You don't want your
program to crash and burn just because someone fed it a big-endian UTF-16 string,
and someone most definitely will try.
But UTF-16 strings can't be treated just like the ASCII formatting of this file in code,
because they are invariably full of null bytes, so we have to rewrite all our string
processing code that normally handles ASCII or UTF-8 just to handle UTF-16!
So in summary, to the responsible programmer, UTF-16 has all the disadvantages of UTF-8,
an extra disadvantage, and none of the advantages.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment