Skip to content

Instantly share code, notes, and snippets.

@bathos
Last active March 22, 2018 02:32
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save bathos/a5710949ce1d036bcb46e8bfc2b964b9 to your computer and use it in GitHub Desktop.
Save bathos/a5710949ce1d036bcb46e8bfc2b964b9 to your computer and use it in GitHub Desktop.
strings.md

When Strings Attack

The Enduring Legacy of a Very Naughty Hack from 1995 / A Bit of Intl

This is my first attempt at a tech talk thing. The focus here is strings in ES. First we’ll talk about what a string is in ES, then some of the consequences that fall out of that (a few aren’t super obvious), and finally some of the cool tools available for working with strings safely & effectively. Because of the nature of the subject we’ll also get into some more general information about Unicode, encodings and broad concepts related to strings and language.

Caveat: I’m hardly an expert in this stuff, so there will likely be some details I describe inaccurately, but hopefully the truth:lies ratio is pretty good.

What is a string in ES

So, one common way to think of strings is as lists of characters. But that’s not what they are in ES (nor most languages). Another common way to think of strings is as lists of codepoints. That’s also not what they are in ES.

They’re lists of code units. Specifically, UInt16 code units that are almost, but not quite, the code units of the UTF16 encoding scheme.

I’ll step back a moment though and provide some definitions, cause the terminology here is probably unfamiliar if you aren’t super into encodings and stuff.

Simple primer first — we’ve got three tiers of abstraction here:

  • character: very human concept, think like ... the "idea" of the letter "A"
  • codepoint: an underlying integer that "backs" characters (not always 1:1)
  • code unit: an underlying integer specific to an encoding scheme that "backs" codepoints (not always 1:1). usually these are not something you need to be aware of when handling strings in a high level language

Let’s start going deeper with character. This is actually the hardest one.

What is a character

Characters are super abstract. In Unicode, one character usually maps to one codepoint ('a', 0x61), but sometimes multiple codepoints can make one character (like when combining marks are used with letters), and sometimes one codepoint can make multiple characters (like German ß).

Now, keep in mind that Unicode is not an encoding. We’re not talking about encoding yet. Unicode is a standard that defines a lot of different stuff, and what we’re talking about here is the mappings between codepoints and characters. Very few programming languages will have you work with real characters as such. Most applications don’t really need to worry about characters, and most of the time, one character maps directly to one codepoint, or it just doesn’t matter much, and that’s the end of the story. But take for example the letter ä. There’s this character, a lowercase a with a dieresis floating on top, and actually there’s not just one way to describe this character with codepoints.

The obvious way is via the codepoint 0xE4. That one codepoint maps to this character.

A less obvious way is via the codepoint sequence 0x061 0x308: ä. It’s the same character, but we "made it" with two different codepoint representations. The second one uses a combining mark. You see this stuff in action also when you use emojis with skin color modifiers. Those are one character, two codepoints. We’ll come back to this stuff later a bit. This terminology has nothing to do with ES, but I wanted to start clarifying these words.

In programming, when we say "character" we usually are speaking very loosely. We often really mean something like "string value of one codepoint". Until we get to the section about canonicalization, we won’t worry too much about this.

What is a codepoint

So, I mentioned before, codepoints aren’t part of any particular encoding. These are numbers that describe either single characters or parts of composed characters or in a few cases one even describes multiple characters (really).

All the integers from 0x0 to 0x10FFFF are "codepoints" in Unicode, but not all of these are valid codepoints. There are two ranges within there that are "blacked out", plus a handful of special cases where it gets fuzzy, like the values involved in creating byte order marks. To be valid, though, a codepoint does not have to have a defined meaning.

The term for a "valid" codepoint — one which may appear in a well-formed Unicode string — is "Unicode scalar value".

In really nice modern languages, strings are lists of codepoints. But ES isn’t that modern.

WTF is a code unit

This is where we get to encoding-specific stuff. There are many ways to encode Unicode (utf8, utf16le, utf32, etc) or subsets of Unicode (ucs2, etc). The most common is UTF8. UTF8 is very efficient and clean. It’s usually the most compact way to represent Unicode text. But from a programming perspective, it’s sort of quirky. It doesn’t map neatly to how we like to pretend strings are simpler than they are. Every codepoint over 0x7F has to be described using 2 to 4 bytes. Unless you’re absolutely certain the text is constrained to ASCII codepoints, you can’t just iterate over bytes and get "characters". You always have to decode.

Each encoding may have its own set of code units. Code units are words that are all the same length. In UTF8, that length is a single byte, though some possible bytes go unused. So UTF8 has 243 code units. Or something thereabout.

UTF32 is an interesting case. In UTF32, the word length is four bytes. That’s actually big enough to describe every codepoint in Unicode without any shenanigans. It’s just 1:1. That makes it awesome for programming actually. In a utf32 byte sequence, every four byte word is gonna just be a codepoint, no decoding needed (but you still need sanity checks for whether it’s a valid codepoint). The thing is, intuitive as this is, it’s kind of crazy to use a uint32 for every codepoint when the vast majority of text out there is gonna be composed of codepoints below 0x80 or at least below 0x10000.

Caring about how many bytes are in play was even more important in 1995. That’s when ES was invented. And at this point, the codepoints beyond the BMP (basic multilingual plane — 0x0 to 0xFFFF) were basically hypothetical. No software supported them, and the handful of codepoints defined there were for shit like byzantine musical symbols.

So Eich took a shortcut (which is not unique to ES). He went for the shittiest of all full Unicode encodings, UTF16. In UTF16, all codepoints in the BMP map to one two-byte word. So at the time, that meant it effectively had those nice 1:1 properties from UTF32. You could access a string by index like "foo"[0] and you’d probably never think about what it really meant. Pseudocod:

fn getCharAt(underlyingBuffer, index) {
  return toString(underlyingBuffer.readUint16LE(index * 2));
}

Fast forward to 2009. There’s lots of stuff above the BMP now, in the SMP (supplemental multilingual plane) — even scripts used by a few obscure but living languages. But who cares? Nobody programming, anyway. None of this seems to matter yet.

Okay, fast forward one more year. It’s October of 2010 and...

EMOJI 👏 LAND 👏 IN 👏 THE 👏 SMP.

OH NO WE CARE WE CARE SO MUCCCCCH 🍆🍆🍆.

The consequences

You can see this early shortcut "leak" in two ways in ES today.

The first is that String.prototype.length abuses the shortcut by reporting the number of code units (pseudocod: buffer.length / 2), not the number of code points or characters:

assert("💩".length === 2);

The second comes into play with indexed string access and methods that do stuff like slicing strings. The indices point to these two-byte code units, not to codepoints. Code units don’t have string values. They are the symbols that get decoded to get codepoints, not the codepoints themselves, and it just so happens to be that for all Unicode scalar values below 0x10000, the code units used to represent them are the same as their own values.

So ... what do we get?

assert("💩"[0].length === 1); // hmm
assert("💩".charCodeAt(0) === 0xD83D); // WHO ARE YOU???
assert("💩".charCodeAt(1) === 0xDCA9); // WHAT DO U WANT

If we log either of these we’ll see "�", the replacement character that gets inserted for invalid values. Those two items are code units that are not also unicode scalar values — the mask is off, they were all code units all along!

Code units of what though? As mentioned earlier, code units are the symbols that appear in a specific encoding. In this case these two are code units of UTF16 specifically: the high and low surrogates. A pair of these together gets decoded to a single codepoint in UTF16. This is the crux of the quirk: ES permits addressing and even handling, as strings, sequences which use surrogate code units (an internal facet of the UTF16 encoding) as if they were scalar value codepoints / characters. This is an abstraction leak.

Excepting cases where you know in advance that a string will be constrained to codepoints that can be described with a single code unit in UTF16, we should be handling strings quite differently than we may be used to. ES2015 introduced the tools to do so.

How can we deal with this

Well, first of I’m gonna point out that annoying as this is, it’s not unusual. A lot of languages have similar quirks and it’s actually kind of a miracle that JS is natively Unicode (or Unicode-ish) to begin with, given it’s 22 years old. So much of the shitstorm around Python 3 concerned them finally fixing the incredibly broken strings from Python <= 2 and people were like waaaah but I love having all these bugs and casually alienating users who don’t speak English by accident waaah. (Python 3 does strings excellently now I think, they cleaned up hardcore.)

Because ES cannot break the web, it’s hard to fix something this systemically deep entirely. So indexed access is still unsafe, as are charAt and charCodeAt. I mentioned earlier that it’s okay to use these if you’re totally confident about the nature of the string content, but personally I’d recommend just treating them as deprecated — we have better ways now.

I’m going to use the word "character" a bit loosely now, to refer to the string value of a given codepoint. This is what we usually think of when programming.

Iterating strings

ES2015 introduced iterators, which is a natural fit for cleaning up strings. Any operation which accesses String.prototype[Symbol.iterator] will correctly yield the right stuff!

const str = '💩💩💩';

let i = 0;

for (const char of str) {
  assert(char === '💩');
  i++;
}

assert(i === 3);

Getting a string’s length

Same dealie!

assert(Array.from(str).length === 3);

// or

assert([ ...str ].length === 3);

Accessing a character by its index in the string

Same dealie!

assert([ ...str ][0] === '💩');

But if we do want the first char and the string is long, we may want to take a different approach. Above, we iterate over the whole string, but we really only need to pump the iterator one time:

assert(str[Symbol.iterator]().next().value === '💩');

Accessing a codepoint by its index in the string

This one is a bit different. On one hand it’s simple: swap out charCodeAt and use codePointAt. But the index you pass in is still a code unit index. Kind of lame, but it’s maybe understandable that they didn’t want to introduce an opaquely different concept of indexing in strings. If you try to access a codepoint at an index which points to the second code unit of a surrogate pair, it will return the surrogate code unit value instead, just like charCodeAt.

assert(str.codePointAt(0) === 0x1F4A9);

...but watch out for:

assert(str.codePointAt(1) === 0xDCA9);

For this reason you typically don’t want to use it this way, though you could if you advanced the index by 2 rather than 1 whenever you see a codepoint over 0xFFFF. However in the most common cases for accessing codepoints, you want all of them — so you can just do this:

const codepoints = [ ...str ].map(char => char.codePointAt(0));

// Or better yet:

const codepoints = new Uint32Array.from(str, char => char.codePointAt(0));

That will be safe, but similar issues come up with things like slice(); these take code unit indices. So when figuring out what the indices are, keep that in mind.

Intl and string comparison

Alright, so let’s move on to something a bit different.

A classic "wat" people point to with js is something like the following:

[ 1, 2, 11, 22, 'abattoir', 'YOLO' ].sort();
// -> [ 1, 11, 2, 22, 'YOLO', 'abattoir' ]

The story here is pretty simple: the default sort is a string comparison. It’s a bit silly — numeric comparison is surely the more commonly useful case. The string comparisons in question are the same results you would get from using > or < to compare strings.

If we put aside the odd choice of default type comparison, there are still other aspects of the result which may be unintuitive, but they are not a wat. In fact this result is very much the only correct answer possible without additional context, and any assumptions about such a context would be PURE, HIGH OCTANE EVIL.

Specifically, the string comparison is byte order (or 2-byte word order technically maybe, though this distinction would be opaque to us). The reason you can’t get 'abattoir' before 'YOLO' out of the box is that collation order is not an intrinsic property of a codepoint. It is a property of languages.

In other words, anything other than byte order would suck because it would either be nondeterministic (user locale) or default to, probably, English (colonialism is shitty).

Take, for example, the following array of strings:

const words = [ '1', '11', '2', '22', 'rövhål', 'Rugburn' ];

What order should these be in? The question makes no sense if you can’t say what language’s collation order to use. And there are a lot of languages, so this could be tricky ... but fortunately, since ES 5.1, it hasn’t been.

const english = new Intl.Collator('en');
const swedish = new Intl.Collator('sv');
words.sort(english.compare); // -> ["1", "11", "2", "22", "rövhål", "Rugburn"]
words.sort(swedish.compare); // -> ["1", "11", "2", "22", "Rugburn", "rövhål"]

So, cool, we have correct collation for alphabetic words per language, but we still have those annoying numbers. We can fix that, though.

const english = new Intl.Collator('en', { numeric: true });
words.sort(english.compare); // -> ["1", "2", "11", "22", "rövhål", "Rugburn"]

In fact there are quite a few additional options, because collation is a much more complex concern than may be obvious.

About Intl

Many people don’t know about the Intl object, but it is not a DOM API — it is an intrinsic namespace object in ES like JSON and Math. Like the rest of ES, it is governed by TC39, but it has a distinct spec (ECMA-402, as opposed to ECMA-262) which is not tied to annual ES versions.

It also provides functionality concerning date and time parsing and formatting and numeric formatting. Because Intl is an unusual case in terms of size (it needs complete knowledge of Unicode, among other things), implementations of ES are permitted to omit it. Fortunately, the last hold out, iOS Safari, now supports it too, but to be safe, you should still typically use Intl functionality with a try-catch with a "less smart" simple fallback in the catch block to be safe.

Bonus round: String canonicalization

I said earlier that we would come back to "characters", how they can be created with different codepoint sequences. Unicode explicitly defines how these alternative representations of character identities work, because it’s sometimes desirable to normalize strings so that two strings don’t use different representations for the same content, e.g. when deduplicating.

ES2015 introduced this as String.prototype.normalize(). It takes an argument indicating whether you want to normalize to canonical compositions (i.e. prefer, where available, pre-composed single codepoint forms) or canonical decompositions (always prefer forms using the atomic combining characters). There are two other options for special backwards compatibility needs but those are lossy transformations which you would rarely want. Usually you want the default, 'NFC' (canonical composition):

assert('\u00E4'.normalize() === '\u0061\u0308'.normalize());

Some additional info on encodings

If you’re curious about more details about implementation and usage in node:

From the perspective of an ES program, strings don’t "have" an encoding, so to speak, but the abstraction leaks described above imply an encoding which is like UCS2 except with surrogates permitted (i.e., all values from 0x0 to 0xFFFF are legal, and values from 0x10000 to 0x10FFFF are just "phenomena" that emerge when the string is viewed or accessed via ES2015 interfaces).

However in practice, this is not how engines actually encode strings internally, and even if it were, real world needs (where strings leave ES) mean there has to be a way to transcode the illegal Unicode into legal Unicode ... or at least into other invalid encodings.

Almost all text today is exchanged in UTF-8, so what emerged from this is a pseudo-standard called WTF-8.* For any valid Unicode string, WTF-8 and UTF-8 are identical: SMP code points are still encoded in the normal UTF-8 manner (UTF-8 does not use surrogates as code units; the code units in UTF8 are each a single byte — surrogates are code units only in UTF-16 and CESU-8). However for an ES-style string which may contain unpaired surrogates as if they were Unicode scalar values on their own, WTF-8 permits encoding these occurences in the regular UTF-8 fashion (that is, they would become three byte sequences beginning with 0xED, which is technically not UTF-8).

That said, WTF-8 is mainly intended for internal interchange; you wouldn’t want to expose it, since it’s (potentially) garbage. Rather one normally would convert illegal unpaired surrogates to 0xFFFD (the famous question-mark-in-a-diamond character). This is indeed what, for example, node’s Buffer will do when it creates buffers from strings. If you need to actually write such strings as-is to a file, you’d have to do so in userland, e.g. via a custom WTF-8 encoder. But you would not want to do that, really.

* At one point WTF-8 was used to mean something else, a joke proposal inspired by the (now fortunately uncommon) problem of agents presuming everything is in oldskool CP-1252 and dumping mojibake.

Although not super relevent to the above, I mention it because it’s classic. Some example passages:

  1. Terminology

The key words "WHAT", "DAMNIT", "GOOD GRIEF", "FOR HEAVEN'S SAKE", "RIDICULOUS", "BLOODY HELL", and "DIE IN A GREAT BIG CHEMICAL FIRE" in this memo are to be interpreted as described in [RFC2119].

For example, code that reads cuneiform text encoded in UTF-16 ignoring the surrogate pairs and the byte order mark, then writes out the 16-bit numbers in UTF-8 thereby making the previous data illegible. The justification for allowing such incompetent code was that there were no major implementations of the Unicode supplementary planes and no significant amounts of data containing bronze age writing. The issue has been dubbed the "Babylonian mess", and the relevant programmers have pledged to produce different bugs in the future.

  1. Security Considerations

Implementers of WTF-8 should not consider the security aspects of how they handle character data. After all, it is inconceivable that in any circumstances an attacker would be able to exploit an incautious parser by sending it an octet sequence.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment