Skip to content

Instantly share code, notes, and snippets.

@isaacs
Created February 17, 2012 04:50
Show Gist options
  • Save isaacs/1850768 to your computer and use it in GitHub Desktop.
Save isaacs/1850768 to your computer and use it in GitHub Desktop.
{ "inline":
{ "unicode-support-in-js-today":"💩"
, "unicode-support-in-js-someday":"😁" }
, "surrogates":
{ "unicode-support-in-js-today":"\uf09f\u92a9"
, "unicode-support-in-js-someday":"\uf09f\u9881" }
}
function assert(x) {
if (!x) console.error("assertion failed")
else console.error("assertion passed")
}
{ "use unicode" // opt-in so we don't break the web
var x = "\u1F638" // > 2 byte unicode code point
var y = "😁" // face with open mouth and smiling eyes
assert(x.length === 1) // less important, but ideal
assert(y.length === 1) // less important, but ideal
assert(x === y) // unicode code points should match literals
console.log(x) // <-- should output a smiley, not "ὣ8"
console.log(y) // <-- should output a smiley, not mochibake
assert(JSON.stringify(y) === JSON.stringify(x))
assert(JSON.parse(JSON.stringify(y)) === y)
assert(JSON.parse(JSON.stringify(x)) === x)
assert(x.indexOf(y) === 0)
assert(y.indexOf(x) === 0)
var arr = ["a", "b", "c"]
var axbxc = arr.join(x)
var aybyc = arr.join(y)
assert(axbxc.split(x)[1] === arr[1])
assert(axbxc.split(y)[1] === arr[1])
// etc.
// They're just characters, and just strings.
// No special anything, just treat it like any other character.
}
@isaacs
Copy link
Author

isaacs commented Feb 17, 2012

If you view this file in a font that supports it, you'd see a smiling face rather than a blank spot there.

The point of this is that JavaScript should not limit its Unicode support to only the code points that fit in 2 bytes, or require the programmer to deal with manually juggling surrogate pairs, translating through cesu-8, or any other absurd monkey business.

Any unicode character should be able to be in any string anywhere, period. There are no other features required. Just sanity.

@BrendanEich
Copy link

First, "use unicode" as a string literal expression-statement that changes semantics is web-breaking (same with "use strict" in ES5 -- ask Apple, Google, and Mozilla).

Second, a real pragma |use unicode;| would not affect all reachable scripts, only the enclosed lexical (block or program/function-body) scope, but strings are visible in the heap. Would scripts lacking |use unicode;| pragmas but able to reach a string literal created in a |use unicode;| block see 2 chars or only one?

This is not as easy as you think. We "prescriptivists" (who get FU from hotheads) aren't such idiots. We've been thinking about this for a while.

/be

@mranney
Copy link

mranney commented Feb 17, 2012

I do not think this is easy. But users are generating text outside of the BMP. It is very sad that JavaScript does not have a good way to support this.

@isaacs
Copy link
Author

isaacs commented Feb 17, 2012

I don't care about the pragma, or even really .length being "accurate". I care about being able to properly handle unicode characters outside the BMP.

What parts of the web would break if JavaScript switched from UCS-2 to UTF-16? Are those parts bigger than the parts that are broken today for want of this functionality?

@BrendanEich
Copy link

@mranney: try this (Lars Hansen formerly of Opera and Adobe suggested it in ES4 days): we underspecify on purpose, let Node.js and Chinese-targeted browsers index by character not uint16, while most global browsers stick with backward compatibility. Interop take the hindmost and let Darwin sort it out. What do you think?

/be

@mranney
Copy link

mranney commented Feb 17, 2012

I think that if we don't get a good way to represent non-BMP characters in our JSON and JavaScript that we (Voxer) won't be able to use JavaScript for very long. I don't think we are alone on this, but I'll bet that the pain isn't being felt much yet because JS isn't the only environment to have trouble with non-BMP chars. This is going to change though, because both iOS and Android are on Unicode 6 now.

@BrendanEich
Copy link

@isaacs: that's just the problem: China et al. have coped with JS as it is, using \uXXXX\uXXXX and length/index over-counting. I am told by folks on the ground that real sites there use JS heavily to process char data and would break with a by-fiat change. Could try, but opt-in seems better at this point. The Lars "Darwin sort it out" idea did not fly in TC39.

More recently, Allen Wirfs-Brock tried a "21-bit char" approach that did not fly in es-discuss or TC39. See https://mail.mozilla.org/pipermail/es-discuss/2011-May/014252.html et seq.

/be

@BrendanEich
Copy link

@mranney: rather than dump node.js why wouldn't you make a fork that does the 21-bit char thing?

/be

@BrendanEich
Copy link

@isaacs: "I don't care about the pragma, or even really .length being "accurate". I care about being able to properly handle unicode characters outside the BMP." -- I get what you mean by not caring about the pragma (but anyone writing the spec has to), but .length has to be "accurate" for you to be able to "properly handle" non-BMP chars. For any consistent values of "accurate" and "properly handle".

The Big Red Switch on global objects may be the best compromise. It wouldn't be a |use unicode;| pragma, rather a <meta> tag or some such. You would opt in early and all same-global strings would be UTF-16 not UCS-2. Any cross-global string passing would copy to the other window's format. No duplicated string library methods a la Java, no observable different representations. Embeddings such as Node could pre-opt-in at startup. Comments?

/be

@isaacs
Copy link
Author

isaacs commented Feb 17, 2012

Who are these people on the ground? Hypothetically, flip a magic switch, everyone wakes up tomorrow and their browsers are all doing UTF-16 instead of UCS-2. What websites break? Can you give me a url? I've heard a lot of hearsay about how this would break the web in some crucial ways, but frankly, it sounds rather specious.

Node is not a VM. V8 follows the spec, and node uses V8. "Forking node" is really forking V8. So that's kind of a gfy answer. Nevertheless, if it comes to bolting on some functionality to v8 to get the unicode support we need, or doing other hacks outside the VM to interpret the bytes sensibly, then that's what we'll have to do.

Here's some JSON that is extremely problematic: https://raw.github.com/gist/1850768/aed7ca8042100b54d90e0d6bb1f8294249c7a1ca/unicode.json

@mranney If you could provide some (de-identified) real-world data or comments about your use case as well, that'd be great.

A big red switch somewhere would be perfectly fine. (We already have those for many other harmony and es-next features.) But without a spec to point at, it's very hard to make a compelling case to v8 that this should be done in the VM.

I don't mean to claim that this is easy. But it should be easy to see it ought to be a priority, since it's actively harmful, not well-solved already by API or frameworks, and not something where you can just learn how it works and get by. (Unlike, say, module systems, breakable for-each loops, or ASI rules.)

TC-39 should not spend time on problems that aren't incredibly hard, until those are solved.

@mranney
Copy link

mranney commented Feb 17, 2012

@BrendanEich For my purposes, a BRS on global objects seems great. Seems like that would be good for the web as well.

I'll dig up some real world examples. The most common one we have is emoji characters, used by even US-based iOS devices these days. The other is people's names, which isn't that common of a problem for us since we are mostly popular in the US right now. I expect that this will change if we start getting popular in Asia.

@BrendanEich
Copy link

@isaacs: before you throw out "specious" please consider the technical facts: you're talking about every JS run over a string today changing indexing from uint16 elements to character elements. I'm not concerned about O(1) vs. O(whatever) here, just the actual change in index and length. Yes, it might all work - that was the hope when Allen proposed 21-bit chars for all - but we don't know and it certainly could break if anyone is doing indexOf, e.g. looking for a certain uint16 value, then looking at the next uint16 to interpret a surrogate pair. Any such hard-coding will break.

So "specious" is a low blow: Allen already proposed to optimize as you glibly assert will work. But browser game theory is heinous. First one to try may lose big time and lose market share, which will force it to go back to the bad old ways.

Thus my advice in favor of opt-in. Is that "specious" too? Yeesh!

(spē'shəs) pronunciation
adj.

  1. Having the ring of truth or plausibility but actually fallacious: a specious argument.
  2. Deceptively attractive.

Rather my argument is based on getting burned many times by breaking changes. Every damn testable feature in the hyperspace gets tiled by usage. Call me conservative on this, but don't call me a deceiver.

I know Node is not a VM but that's not the point. Rather, Node can be all-or-nothing. Web browsers cannot. Some windows may load old content with hard-coded surrogate matching. Some may load new content and opt in. That's all.

/be

@BrendanEich
Copy link

Here's one example (there are many) of "people on the ground" (Microsoft's I18N expert, in this case) raising the same "breaks hardcoding" objection I mentioned:

https://mail.mozilla.org/pipermail/es-discuss/2011-May/014256.html

/be

@allenwb
Copy link

allenwb commented Feb 17, 2012

I still think my proposal is technically sound and would support converting to 21-bit characters in a backwards compatible manner (at least form a pure JS perspective, DOM interaction may be another matter). However, there are probably some rough edges to work out. I'd need to go back and review the es-discuss thread to refresh my memory. Existing code that uses UTF-16 encoding within JS strings would still use UTF-16. The only difference would be that their 16-bit code units would be stored within (logically) 21-bit string elements instead of 16-bit string elements. Of course, the actual physical storage could be optimized to only use >16 bits cells when actually needed. I think that the biggest compatibility issues probably relate to the DOM usage of JS strings.

The issue that I think stalled the proposal (over and beyond schedule pressure) was feedback from the Unicode experts that they really didn't care about uniform 21-bit characters. They seem to be perfectly happy with strings that always manifest variable length UTF-16 encodings. The argument was that even if your had 21-bit uniform characters, the "semantics" of Unicode still require parsing for multi-codepoint control sequences. The parsing that needs to be done to recognize UTF-16 encodings supposedly fits naturally into that sort of string processing.

Allen

@BrendanEich
Copy link

@allenwb: hardcoding breaks, though: anyone matching uint16 parts of a surrogate pair will mismatch or fail to match. Right?

Yeah, DOM interfacing might suck. Or mostly work -- your proposal has appeal to me because Gecko (and WebKit and Trident and ...) have to respect non-BMP characters and find the right glyphs for them. But it could be a big change for all browsers.

What do you think of the opt-in via per-global "Big Red Switch" idea?

/be

@allenwb
Copy link

allenwb commented Feb 17, 2012

Regarding Shawn's "breaks hardcoding" objection, I think he misunderstood the difference between JS Strings with 21-bit (or 32-bit) elements and UTF-32 strings. The strings I proposed were the former, not the latter. With 21-bit cells you are not limited to using only valid UTF-32 codepoints. It is fine to have 0xD800-DFFF values in such a string. If your code only puts 16-bit values into the string elements and does UTF-16 processing of them, under my proposal the string indices would be exactly the same as with JS today.

Allen

@BrendanEich
Copy link

@allenwb: there's no UTF-32 confusion. The problem is hard-coding halves of a surrogate pair into indexOf or similar, reading data that was not likewise generated. It's not a symmetric GIGO situation, the 21-bit data will be full Unicode characters while the old JS will be matching uint16 pieces in vain.

/be

@allenwb
Copy link

allenwb commented Feb 17, 2012

@BrendanEich Yes, you need to know what you are doing if you are mixing UTF-16 and UTF-32 strings. My premise is that if you have existing code that is explicitly dealing with UTF-16 encodings it would continue to work just fine as long as you don't unknowing insert UTF-32 data into the mix. I guess that could be an issue if I/O (eg, XDR) automatically started populating strings using UTF-32. Maybe that's where the "Big Red Switch" would come into play.

@piscisaureus
Copy link

To me, the most important thing is to make the round trip between utf8 and javascript correctly.
assert(JSON.parse(JSON.stringify(x)) === x)

Changing the interpretation of "\uxxxx" literals is not desirable imho as it creates ambiguity. If we do this: var x = "\u1F638" // > 2 byte unicode code point, then how does one write this as a single string literal: var x = "\u1F63" + "8" ?

@isaacs
Copy link
Author

isaacs commented Feb 17, 2012

Hm, github ate my reply. I'll try to do it justice. I'm sure the original was much more compelling :)

@BrendanEich I wasn't glibly asserting it'd work. I was glibly asking what specifically would break. I've heard several claims about how it would break things. Those claims have the ring of truth, but I've grown skeptical of rings, even truthy ones. I'd like to see a program in the wild that'll actually be affected, especially one that isn't using strings as a make-shift binary array, or doing fancy byte-shuffling in order to work around this very issue.

Skepticism aside, I'm not completely insane. Of course this would have to be opt-in. If it can't be a pragma, fine; a BRS, or even a separate special type would be perfectly acceptable, as long as it would enable us to serialize and deserialize the string faithfully, and know what the characters should be rather than rely on the dom to sort it out for us.

If we do this: var x = "\u1F638" // > 2 byte unicode code point, then how does one write this as a single string literal: var x = "\u1F63" + "8" ?

Yes, that's true. It'd have to either be somehow framed, like \U+[1F638] or something, or we just bite the bullet and write out the surrogates.

@BrendanEich
Copy link

@izs: ok, that helps -- @allenwb or I will restart, I think with a BRS-per-global, on es-discuss and get it on the next tc39 meeting's agenda.

/be

@allenwb
Copy link

allenwb commented Feb 19, 2012

@mranney @piscisaureus

Please see Gist 1861530

For some reason, I couldn't post it as a comment here.

@isaacs
Copy link
Author

isaacs commented Feb 21, 2012

It appears that, in node at least, we're being bitten by http://code.google.com/p/v8/issues/detail?id=761. We will work with v8 to figure out the best solution there, to get from utf8 bytes into a JavaScript string, which doesn't arbitrarily trash non-BMP characters. I apologize for misunderstanding the issue and impugning the good name of JavaScript. (In my defense, it's a particularly complicated issue, and JavaScript's name isn't really all that good ;)

Nevertheless, I think that clearly the long-term correct fix is for JavaScript to handle unicode intelligently (albeit with the presence of big red switches), so I'm very happy to see your proposal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment