Skip to content

Instantly share code, notes, and snippets.

@isaacs
Created February 17, 2012 04:50
Show Gist options
  • Save isaacs/1850768 to your computer and use it in GitHub Desktop.
Save isaacs/1850768 to your computer and use it in GitHub Desktop.
{ "inline":
{ "unicode-support-in-js-today":"💩"
, "unicode-support-in-js-someday":"😁" }
, "surrogates":
{ "unicode-support-in-js-today":"\uf09f\u92a9"
, "unicode-support-in-js-someday":"\uf09f\u9881" }
}
function assert(x) {
if (!x) console.error("assertion failed")
else console.error("assertion passed")
}
{ "use unicode" // opt-in so we don't break the web
var x = "\u1F638" // > 2 byte unicode code point
var y = "😁" // face with open mouth and smiling eyes
assert(x.length === 1) // less important, but ideal
assert(y.length === 1) // less important, but ideal
assert(x === y) // unicode code points should match literals
console.log(x) // <-- should output a smiley, not "ὣ8"
console.log(y) // <-- should output a smiley, not mochibake
assert(JSON.stringify(y) === JSON.stringify(x))
assert(JSON.parse(JSON.stringify(y)) === y)
assert(JSON.parse(JSON.stringify(x)) === x)
assert(x.indexOf(y) === 0)
assert(y.indexOf(x) === 0)
var arr = ["a", "b", "c"]
var axbxc = arr.join(x)
var aybyc = arr.join(y)
assert(axbxc.split(x)[1] === arr[1])
assert(axbxc.split(y)[1] === arr[1])
// etc.
// They're just characters, and just strings.
// No special anything, just treat it like any other character.
}
@allenwb
Copy link

allenwb commented Feb 17, 2012

Regarding Shawn's "breaks hardcoding" objection, I think he misunderstood the difference between JS Strings with 21-bit (or 32-bit) elements and UTF-32 strings. The strings I proposed were the former, not the latter. With 21-bit cells you are not limited to using only valid UTF-32 codepoints. It is fine to have 0xD800-DFFF values in such a string. If your code only puts 16-bit values into the string elements and does UTF-16 processing of them, under my proposal the string indices would be exactly the same as with JS today.

Allen

@BrendanEich
Copy link

@allenwb: there's no UTF-32 confusion. The problem is hard-coding halves of a surrogate pair into indexOf or similar, reading data that was not likewise generated. It's not a symmetric GIGO situation, the 21-bit data will be full Unicode characters while the old JS will be matching uint16 pieces in vain.

/be

@allenwb
Copy link

allenwb commented Feb 17, 2012

@BrendanEich Yes, you need to know what you are doing if you are mixing UTF-16 and UTF-32 strings. My premise is that if you have existing code that is explicitly dealing with UTF-16 encodings it would continue to work just fine as long as you don't unknowing insert UTF-32 data into the mix. I guess that could be an issue if I/O (eg, XDR) automatically started populating strings using UTF-32. Maybe that's where the "Big Red Switch" would come into play.

@piscisaureus
Copy link

To me, the most important thing is to make the round trip between utf8 and javascript correctly.
assert(JSON.parse(JSON.stringify(x)) === x)

Changing the interpretation of "\uxxxx" literals is not desirable imho as it creates ambiguity. If we do this: var x = "\u1F638" // > 2 byte unicode code point, then how does one write this as a single string literal: var x = "\u1F63" + "8" ?

@isaacs
Copy link
Author

isaacs commented Feb 17, 2012

Hm, github ate my reply. I'll try to do it justice. I'm sure the original was much more compelling :)

@BrendanEich I wasn't glibly asserting it'd work. I was glibly asking what specifically would break. I've heard several claims about how it would break things. Those claims have the ring of truth, but I've grown skeptical of rings, even truthy ones. I'd like to see a program in the wild that'll actually be affected, especially one that isn't using strings as a make-shift binary array, or doing fancy byte-shuffling in order to work around this very issue.

Skepticism aside, I'm not completely insane. Of course this would have to be opt-in. If it can't be a pragma, fine; a BRS, or even a separate special type would be perfectly acceptable, as long as it would enable us to serialize and deserialize the string faithfully, and know what the characters should be rather than rely on the dom to sort it out for us.

If we do this: var x = "\u1F638" // > 2 byte unicode code point, then how does one write this as a single string literal: var x = "\u1F63" + "8" ?

Yes, that's true. It'd have to either be somehow framed, like \U+[1F638] or something, or we just bite the bullet and write out the surrogates.

@BrendanEich
Copy link

@izs: ok, that helps -- @allenwb or I will restart, I think with a BRS-per-global, on es-discuss and get it on the next tc39 meeting's agenda.

/be

@allenwb
Copy link

allenwb commented Feb 19, 2012

@mranney @piscisaureus

Please see Gist 1861530

For some reason, I couldn't post it as a comment here.

@isaacs
Copy link
Author

isaacs commented Feb 21, 2012

It appears that, in node at least, we're being bitten by http://code.google.com/p/v8/issues/detail?id=761. We will work with v8 to figure out the best solution there, to get from utf8 bytes into a JavaScript string, which doesn't arbitrarily trash non-BMP characters. I apologize for misunderstanding the issue and impugning the good name of JavaScript. (In my defense, it's a particularly complicated issue, and JavaScript's name isn't really all that good ;)

Nevertheless, I think that clearly the long-term correct fix is for JavaScript to handle unicode intelligently (albeit with the presence of big red switches), so I'm very happy to see your proposal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment