bathos/strings.md

## strings.md

      
    Raw
  

              strings.md
            
          
    When Strings Attack

The Enduring Legacy of a Very Naughty Hack from 1995 / A Bit of Intl
This is my first attempt at a tech talk thing. The focus here is strings in ES.
First we’ll talk about what a string is in ES, then some of the consequences
that fall out of that (a few aren’t super obvious), and finally some of the cool
tools available for working with strings safely & effectively. Because of the
nature of the subject we’ll also get into some more general information about
Unicode, encodings and broad concepts related to strings and language.
Caveat: I’m hardly an expert in this stuff, so there will likely be some details
I describe inaccurately, but hopefully the truth:lies ratio is pretty good.


What is a string in ES

What is a character
What is a codepoint
WTF is a code unit
The consequences


How can we deal with this

Getting a string’s length
Accessing a character by its index in the string
Accessing a codepoint by its index in the string


Intl and string comparison
About Intl
Bonus round: String canonicalization
Some additional info on encodings


What is a string in ES

So, one common way to think of strings is as lists of characters. But that’s
not what they are in ES (nor most languages). Another common way to think of
strings is as lists of codepoints. That’s also not what they are in ES.
They’re lists of code units. Specifically, UInt16 code units that are almost,
but not quite, the code units of the UTF16 encoding scheme.
I’ll step back a moment though and provide some definitions, cause the
terminology here is probably unfamiliar if you aren’t super into encodings and
stuff.
Simple primer first — we’ve got three tiers of abstraction here:

character: very human concept, think like ... the "idea" of the letter "A"
codepoint: an underlying integer that "backs" characters (not always 1:1)
code unit: an underlying integer specific to an encoding scheme that "backs"
codepoints (not always 1:1). usually these are not something you need to be
aware of when handling strings in a high level language

Let’s start going deeper with character. This is actually the hardest one.
What is a character

Characters are super abstract. In Unicode, one character usually maps to one
codepoint ('a', 0x61), but sometimes multiple codepoints can make one character
(like when combining marks are used with letters), and sometimes one codepoint
can make multiple characters (like German ß).
Now, keep in mind that Unicode is not an encoding. We’re not talking about
encoding yet. Unicode is a standard that defines a lot of different stuff, and
what we’re talking about here is the mappings between codepoints and characters.
Very few programming languages will have you work with real characters as such.
Most applications don’t really need to worry about characters, and most of the
time, one character maps directly to one codepoint, or it just doesn’t matter
much, and that’s the end of the story. But take for example the letter ä.
There’s this character, a lowercase a with a dieresis floating on top, and
actually there’s not just one way to describe this character with codepoints.
The obvious way is via the codepoint 0xE4. That one codepoint maps to this
character.
A less obvious way is via the codepoint sequence 0x061 0x308: ä. It’s the same
character, but we "made it" with two different codepoint representations. The
second one uses a combining mark. You see this stuff in action also when you use
emojis with skin color modifiers. Those are one character, two codepoints. We’ll
come back to this stuff later a bit. This terminology has nothing to do with ES,
but I wanted to start clarifying these words.
In programming, when we say "character" we usually are speaking very loosely. We
often really mean something like "string value of one codepoint". Until we get
to the section about canonicalization, we won’t worry too much about this.
What is a codepoint

So, I mentioned before, codepoints aren’t part of any particular encoding.
These are numbers that describe either single characters or parts of composed
characters or in a few cases one even describes multiple characters (really).
All the integers from 0x0 to 0x10FFFF are "codepoints" in Unicode, but not
all of these are valid codepoints. There are two ranges
within there that are "blacked out", plus a handful of special cases where it
gets fuzzy, like the values involved in creating byte order marks. To be valid,
though, a codepoint does not have to have a defined meaning.
The term for a "valid" codepoint — one which may appear in a well-formed
Unicode string — is "Unicode scalar value".
In really nice modern languages, strings are lists of codepoints. But ES isn’t
that modern.
WTF is a code unit

This is where we get to encoding-specific stuff. There are many ways to encode
Unicode (utf8, utf16le, utf32, etc) or subsets of Unicode (ucs2, etc). The most
common is UTF8. UTF8 is very efficient and clean. It’s usually the most compact
way to represent Unicode text. But from a programming perspective, it’s sort of
quirky. It doesn’t map neatly to how we like to pretend strings are simpler than
they are. Every codepoint over 0x7F has to be described using 2 to 4 bytes.
Unless you’re absolutely certain the text is constrained to ASCII codepoints,
you can’t just iterate over bytes and get "characters". You always have to
decode.
Each encoding may have its own set of code units. Code units are words that are
all the same length. In UTF8, that length is a single byte, though some possible
bytes go unused. So UTF8 has 243 code units. Or something thereabout.
UTF32 is an interesting case. In UTF32, the word length is four bytes. That’s
actually big enough to describe every codepoint in Unicode without any
shenanigans. It’s just 1:1. That makes it awesome for programming actually. In a
utf32 byte sequence, every four byte word is gonna just be a codepoint, no
decoding needed (but you still need sanity checks for whether it’s a valid
codepoint). The thing is, intuitive as this is, it’s kind of crazy to use a
uint32 for every codepoint when the vast majority of text out there is gonna
be composed of codepoints below 0x80 or at least below 0x10000.
Caring about how many bytes are in play was even more important in 1995. That’s
when ES was invented. And at this point, the codepoints beyond the BMP (basic
multilingual plane — 0x0 to 0xFFFF) were basically hypothetical. No software
supported them, and the handful of codepoints defined there were for shit like
byzantine musical symbols.
So Eich took a shortcut (which is not unique to ES). He went for the shittiest
of all full Unicode encodings, UTF16. In UTF16, all codepoints in the BMP map to
one two-byte word. So at the time, that meant it effectively had those nice
1:1 properties from UTF32. You could access a string by index like "foo"[0]
and you’d probably never think about what it really meant. Pseudocod:
fn getCharAt(underlyingBuffer, index) {
  return toString(underlyingBuffer.readUint16LE(index * 2));
}

Fast forward to 2009. There’s lots of stuff above the BMP now, in the SMP
(supplemental multilingual plane) — even scripts used by a few obscure but
living languages. But who cares? Nobody programming, anyway. None of this seems
to matter yet.
Okay, fast forward one more year. It’s October of 2010 and...
EMOJI 👏 LAND 👏 IN 👏 THE 👏 SMP.
OH NO WE CARE WE CARE SO MUCCCCCH 🍆🍆🍆.
The consequences

You can see this early shortcut "leak" in two ways in ES today.
The first is that String.prototype.length abuses the shortcut by reporting the
number of code units (pseudocod: buffer.length / 2), not the number of code
points or characters:
assert("💩".length === 2);
The second comes into play with indexed string access and methods that do stuff
like slicing strings. The indices point to these two-byte code units, not to
codepoints. Code units don’t have string values. They are the symbols that get
decoded to get codepoints, not the codepoints themselves, and it just so happens
to be that for all Unicode scalar values below 0x10000, the code units used to
represent them are the same as their own values.
So ... what do we get?
assert("💩"[0].length === 1); // hmm
assert("💩".charCodeAt(0) === 0xD83D); // WHO ARE YOU???
assert("💩".charCodeAt(1) === 0xDCA9); // WHAT DO U WANT
If we log either of these we’ll see "�", the replacement character that gets
inserted for invalid values. Those two items are code units that are not also
unicode scalar values — the mask is off, they were all code units all along!
Code units of what though? As mentioned earlier, code units are the symbols
that appear in a specific encoding. In this case these two are code units of
UTF16 specifically: the high and low surrogates. A pair of these together gets
decoded to a single codepoint in UTF16. This is the crux of the quirk: ES
permits addressing and even handling, as strings, sequences which use surrogate
code units (an internal facet of the UTF16 encoding) as if they were scalar
value codepoints / characters. This is an abstraction leak.
Excepting cases where you know in advance that a string will be constrained to
codepoints that can be described with a single code unit in UTF16, we should be
handling strings quite differently than we may be used to. ES2015 introduced the
tools to do so.
How can we deal with this

Well, first of I’m gonna point out that annoying as this is, it’s not unusual.
A lot of languages have similar quirks and it’s actually kind of a miracle that
JS is natively Unicode (or Unicode-ish) to begin with, given it’s 22 years old.
So much of the shitstorm around Python 3 concerned them finally fixing the
incredibly broken strings from Python <= 2 and people were like waaaah but I
love having all these bugs and casually alienating users who don’t speak English
by accident waaah. (Python 3 does strings excellently now I think, they cleaned
up hardcore.)
Because ES cannot break the web, it’s hard to fix something this systemically
deep entirely. So indexed access is still unsafe, as are charAt and
charCodeAt. I mentioned earlier that it’s okay to use these if you’re totally
confident about the nature of the string content, but personally I’d recommend
just treating them as deprecated — we have better ways now.
I’m going to use the word "character" a bit loosely now, to refer to the string
value of a given codepoint. This is what we usually think of when programming.
Iterating strings

ES2015 introduced iterators, which is a natural fit for cleaning up strings. Any
operation which accesses String.prototype[Symbol.iterator] will correctly
yield the right stuff!
const str = '💩💩💩';

let i = 0;

for (const char of str) {
  assert(char === '💩');
  i++;
}

assert(i === 3);
Getting a string’s length

Same dealie!
assert(Array.from(str).length === 3);

// or

assert([ ...str ].length === 3);
Accessing a character by its index in the string

Same dealie!
assert([ ...str ][0] === '💩');
But if we do want the first char and the string is long, we may want to take a different approach. Above, we iterate over the whole string, but we really only need to pump the iterator one time:
assert(str[Symbol.iterator]().next().value === '💩');
Accessing a codepoint by its index in the string

This one is a bit different. On one hand it’s simple: swap out charCodeAt and
use codePointAt. But the index you pass in is still a code unit index. Kind of
lame, but it’s maybe understandable that they didn’t want to introduce an
opaquely different concept of indexing in strings. If you try to access a
codepoint at an index which points to the second code unit of a surrogate pair,
it will return the surrogate code unit value instead, just like charCodeAt.
assert(str.codePointAt(0) === 0x1F4A9);
...but watch out for:
assert(str.codePointAt(1) === 0xDCA9);
For this reason you typically don’t want to use it this way, though you could if
you advanced the index by 2 rather than 1 whenever you see a codepoint over
0xFFFF. However in the most common cases for accessing codepoints, you want
all of them — so you can just do this:
const codepoints = [ ...str ].map(char => char.codePointAt(0));

// Or better yet:

const codepoints = new Uint32Array.from(str, char => char.codePointAt(0));
That will be safe, but similar issues come up with things like slice(); these
take code unit indices. So when figuring out what the indices are, keep that in
mind.
Intl and string comparison

Alright, so let’s move on to something a bit different.
A classic "wat" people point to with js is something like the following:
[ 1, 2, 11, 22, 'abattoir', 'YOLO' ].sort();
// -> [ 1, 11, 2, 22, 'YOLO', 'abattoir' ]
The story here is pretty simple: the default sort is a string comparison. It’s a
bit silly — numeric comparison is surely the more commonly useful case. The
string comparisons in question are the same results you would get from using >
or < to compare strings.
If we put aside the odd choice of default type comparison, there are still other
aspects of the result which may be unintuitive, but they are not a wat. In fact
this result is very much the only correct answer possible without additional
context, and any assumptions about such a context would be
PURE, HIGH OCTANE EVIL.
Specifically, the string comparison is byte order (or 2-byte word order
technically maybe, though this distinction would be opaque to us). The reason
you can’t get 'abattoir' before 'YOLO' out of the box is that collation order is
not an intrinsic property of a codepoint. It is a property of languages.
In other words, anything other than byte order would suck because it would either
be nondeterministic (user locale) or default to, probably, English (colonialism is
shitty).
Take, for example, the following array of strings:
const words = [ '1', '11', '2', '22', 'rövhål', 'Rugburn' ];
What order should these be in? The question makes no sense if you can’t say what
language’s collation order to use. And there are a lot of languages, so this
could be tricky ... but fortunately, since ES 5.1, it hasn’t been.
const english = new Intl.Collator('en');
const swedish = new Intl.Collator('sv');
words.sort(english.compare); // -> ["1", "11", "2", "22", "rövhål", "Rugburn"]
words.sort(swedish.compare); // -> ["1", "11", "2", "22", "Rugburn", "rövhål"]
So, cool, we have correct collation for alphabetic words per language, but we
still have those annoying numbers. We can fix that, though.
const english = new Intl.Collator('en', { numeric: true });
words.sort(english.compare); // -> ["1", "2", "11", "22", "rövhål", "Rugburn"]
In fact there are quite a few additional options, because collation is a much
more complex concern than may be obvious.
About Intl

Many people don’t know about the Intl object, but it is not a DOM API — it is
an intrinsic namespace object in ES like JSON and Math. Like the rest of ES,
it is governed by TC39, but it has a distinct spec (ECMA-402, as opposed to
ECMA-262) which is not tied to annual ES versions.
It also provides functionality concerning date and time parsing and formatting
and numeric formatting. Because Intl is an unusual case in terms of size (it
needs complete knowledge of Unicode, among other things), implementations of ES
are permitted to omit it. Fortunately, the last hold out, iOS Safari, now
supports it too, but to be safe, you should still typically use Intl
functionality with a try-catch with a "less smart" simple fallback in the catch
block to be safe.
Bonus round: String canonicalization

I said earlier that we would come back to "characters", how they can be created
with different codepoint sequences. Unicode explicitly defines how these
alternative representations of character identities work, because it’s sometimes
desirable to normalize strings so that two strings don’t use different
representations for the same content, e.g. when deduplicating.
ES2015 introduced this as String.prototype.normalize(). It takes an argument
indicating whether you want to normalize to canonical compositions (i.e.
prefer, where available, pre-composed single codepoint forms) or canonical
decompositions (always prefer forms using the atomic combining characters).
There are two other options for special backwards compatibility needs but those
are lossy transformations which you would rarely want. Usually you want the
default, 'NFC' (canonical composition):
assert('\u00E4'.normalize() === '\u0061\u0308'.normalize());
Some additional info on encodings

If you’re curious about more details about implementation and usage in node:
From the perspective of an ES program, strings don’t "have" an encoding, so to
speak, but the abstraction leaks described above imply an encoding which is
like UCS2 except with surrogates permitted (i.e., all values from 0x0 to 0xFFFF
are legal, and values from 0x10000 to 0x10FFFF are just "phenomena" that emerge
when the string is viewed or accessed via ES2015 interfaces).
However in practice, this is not how engines actually encode strings internally,
and even if it were, real world needs (where strings leave ES) mean there has
to be a way to transcode the illegal Unicode into legal Unicode ... or at least
into other invalid encodings.
Almost all text today is exchanged in UTF-8, so what emerged from this is a
pseudo-standard called WTF-8.* For any
valid Unicode string, WTF-8 and UTF-8 are identical: SMP code points are still
encoded in the normal UTF-8 manner (UTF-8 does not use surrogates as code units;
the code units in UTF8 are each a single byte — surrogates are code units only
in UTF-16 and CESU-8). However for an ES-style string which may contain
unpaired surrogates as if they were Unicode scalar values on their own, WTF-8
permits encoding these occurences in the regular UTF-8 fashion (that is, they
would become three byte sequences beginning with 0xED, which is technically
not UTF-8).
That said, WTF-8 is mainly intended for internal interchange; you wouldn’t want
to expose it, since it’s (potentially) garbage. Rather one normally would
convert illegal unpaired surrogates to 0xFFFD (the famous
question-mark-in-a-diamond character). This is indeed what, for example, node’s
Buffer will do when it creates buffers from strings. If you need to actually
write such strings as-is to a file, you’d have to do so in userland, e.g. via a
custom WTF-8 encoder. But you would not want to do that, really.

* At one point WTF-8 was used to mean
something else,
a joke proposal inspired by the (now fortunately uncommon) problem of agents
presuming everything is in oldskool CP-1252 and dumping mojibake.
Although not super relevent to the above, I mention it because it’s classic.
Some example passages:


Terminology

The key words "WHAT", "DAMNIT", "GOOD GRIEF", "FOR HEAVEN'S SAKE",
"RIDICULOUS", "BLOODY HELL", and "DIE IN A GREAT BIG CHEMICAL FIRE"
in this memo are to be interpreted as described in [RFC2119].


For example, code that reads cuneiform text
encoded in UTF-16 ignoring the surrogate pairs and the byte order
mark, then writes out the 16-bit numbers in UTF-8 thereby making the
previous data illegible.  The justification for allowing such
incompetent code was that there were no major implementations of the
Unicode supplementary planes and no significant amounts of data
containing bronze age writing.  The issue has been dubbed the
"Babylonian mess", and the relevant programmers have pledged to
produce different bugs in the future.


Security Considerations

Implementers of WTF-8 should not consider the security aspects of how
they handle character data.  After all, it is inconceivable that in
any circumstances an attacker would be able to exploit an incautious
parser by sending it an octet sequence.