-
-
Save mranney/1707371 to your computer and use it in GitHub Desktop.
From: Chris DeSalvo <chris.desalvo@voxer.com> | |
Subject: Why we can't process Emoji anymore | |
Date: Thu, 12 Jan 2012 18:49:20 -0800 | |
Message-Id: <AE459007-DF2E-4E41-B7A4-FA5C2A83025F@voxer.com> | |
--Apple-Mail=_6DEAA046-886A-4A03-8508-6FD077D18F8B | |
Content-Transfer-Encoding: quoted-printable | |
Content-Type: text/plain; | |
charset=utf-8 | |
If you are not interested in the technical details of why Emoji current = | |
do not work in our iOS client, you can stop reading now. | |
Many many years ago a Japanese cell phone carrier called SoftBank came = | |
up with the idea for emoji and built it into the cell phones that it = | |
sold for their network. The problem they had was in deciding how to = | |
represent the characters in electronic form. They decided to use = | |
Unicode code points in the private use areas. This is a perfectly valid = | |
thing to do as long as your data stays completely within your product. = | |
However, with text messages the data has to interoperate with other = | |
carriers' phones. | |
Unfortunately SoftBank decided to copyright their entire set of images, = | |
their encoding, etc etc etc and refused to license them to anyone. So, = | |
when NTT and KDDI (two other Japanese carriers) decided that they wanted = | |
emoji they had to do their own implementations. To make things even = | |
more sad they decided not to work with each other and gang up on = | |
SoftBank. So, in Japan, there were three competing emoji standards that = | |
did not interoperate. | |
In 2010 Apple released iOS 2.2 and added support for the SoftBank = | |
implementation of emoji. Since SoftBank would not license their emoji = | |
out for use on networks other than their own Apple agreed to only make = | |
the emoji keyboard visible on iPhones that were on the SoftBank network. = | |
That's why you used to have to run an ad-ware app to make that keyboard = | |
visible. | |
Later in 2010 the Unicode consortium released version 6.0 of the Unicode = | |
standard. (In case any cares, Unicode originated in 1987 as a joint = | |
research project between Xerox and Apple.) The smart Unicode folks = | |
added all of emoji (about 740 glyphs) plus the new Indian Rupee sign, = | |
more symbols needed for several African languages, and hundreds more CJK = | |
symbols for, well, Chinese/Japanese/Korean (CJK also covers Vietnamese, = | |
but now, like then, nobody gives Vietnam any credit). | |
With iOS 5.0 Apple (wisely) decided to adopt Unicode 6.0. The emoji = | |
keyboard was made available to all users and generates code points from = | |
their new Unicode 6.0 locations. Apple also added this support to OS X = | |
Lion. | |
You may be asking, "So this all sounds great. Why can't I type a smiley = | |
in Voxer and have the damn thing show up?" Glad you asked. Consider = | |
the following glyph: | |
=F0=9F=98=84 | |
SMILING FACE WITH OPEN MOUTH AND SMILING EYES | |
Unicode: U+1F604 (U+D83D U+DE04), UTF-8: F0 9F 98 84 | |
You can get this info for any character that OS X can render by bringing = | |
up the Character Viewer panel and right-clicking on a glyph and = | |
selecting "Copy Character Info". So, what this shows us is that for = | |
this smiley face the Unicode code point is 0x1F604. For those of you = | |
who are not hex-savvy that is the decimal number 128,516. That's a = | |
pretty big number. | |
The code point that SoftBank had used was 0xFB55 (or 64,341 decimal). = | |
That's a pretty tiny number. You can represent 64,341 with just 16 = | |
bits. Dealing with 16 bits is something computers do really well. To = | |
represent 0x1F604 you need 17 bits. Since bits come in 8-packs you end = | |
up using 24 total. Computers hate odd numbers and dealing with a group = | |
of 3 bytes is a real pain. | |
I have to make a side-trip now and explain Unicode character encodings. = | |
Different kinds of computer systems, and the networks that connect them, = | |
think of data in different ways. Inside of the computer the processor = | |
thinks of data in terms defined by its physical properties. An old = | |
Commodore 64 operated on one byte, 8 bits, at a time. Later computers = | |
had 16-bit hardware, then 32, and now most of the computers you will = | |
encounter on your desk prefer to operate on data 64-bits (8 bytes) at a = | |
time. Networks still like to think of data as a string of individual = | |
bytes and try to ignore any such logical groupings. To represent the = | |
entire Unicode code space you need 21 bits. That is a frustrating size. = | |
Also, if you tend to work in Latin script (English, French, Italian, = | |
etc) where all of the codes you'll ever use fit neatly in 8 bits (the = | |
ISO Latin-1 set) then it is wasteful to have to use 24 bits (21 rounded = | |
up to the next byte boundary) because those top 17 bits will always be = | |
unused. So what do you do? You make alternate encodings. | |
There are many encodings, the most common being UTF-8 and UTF-16. There = | |
is also a UTF-32, but it isn't very popular since it's not = | |
space-friendly. UTF-8 has the nice property that all of the original = | |
ASCII characters preserve their encoding. So far in this email every = | |
single character I've typed (other than the smiley) has been an ASCII = | |
character and fits neatly in 7 bits. One byte per character is really = | |
friendly to work with, fits nicely in memory, and doesn't take much = | |
space on disk. If you sometimes need to represent a big character, like = | |
that smiley up there, then you do that with a multi-byte sequence. As = | |
we can see in the info above the UTF-8 for that smiley is the 4-byte = | |
sequence [F0 9F 98 84]. Make a file with those four byes in it and open = | |
it in any editor that is UTF-8 aware and you'll get that smiley. | |
Some Unicode-aware programming languages such as Java, Objective-C, and = | |
(most) JavaScript systems use the UTF-16 encoding internally. UTF-16 = | |
has some really good properties of its own that I won't digress into = | |
here. The thing to note is that it uses 16 bits for most characters. = | |
So, whereas a small letter 'a' would be the single byte 0x61in ASCII or = | |
UTF-8, in UTF-16 it is the 2-byte 0x0061. Note that the SoftBank 0xFB55 = | |
fits nicely in that 16-bit space. Hmm, but our smiley has a Unicode = | |
value of U+1F604 (we use U+ when throwing Unicode values around in = | |
hexadecimal) and that will NOT fit in 16 bits. Remember, we need 17. = | |
So what do we do? Well, the Unicode guys are really smart (UTF-8 is = | |
fucking brilliant, no, really!) and they invented a thing called a = | |
"surrogate pair". With a surrogate pair you can use two 16-bit values = | |
to encode that code point that is too big to fit into a single 16-bit = | |
field. Surrogate pairs have a specific bit pattern in their top bits = | |
that lets UTF-16 compliant systems know that they are a surrogate pair = | |
that represent a single code point and not two separate UTF-16 code = | |
points. In the example smiley above we find that the UTF-16 surrogate = | |
pair that encodes U+1F604 is [U+D83D U+DE04]. Put those four bytes into = | |
a file and open it in any program that understands UTF-16 and you'll see = | |
that smiley. He really is quite cheery. | |
So, I've already said that Objective-C and Java and (most) JavaScript = | |
systems use UTF-16 internally so we should be all cool, right? Well, = | |
see, it was that "(most)" that is the problem. | |
Before there was UTF-16 there was another encoding used by Java and = | |
JavaScript called UCS-2. UCS-2 is a strict 16-bit encoding. You get 16 = | |
bits per character and no more. So how do you represent U+1F604 which = | |
needs 17 bits? You don't. Period. UCS-2 has no notion of surrogate = | |
pairs. Through most of time this was ok because the Unicode consortium = | |
hadn't defined many code points beyond the 16 bit range so there was = | |
nothing out there to encode. But in 1996 it was clear that to encode = | |
all the CJK languages (and Vietnamese!) that we'd start needing those = | |
17+ bit code points. SUN updated Java to stop using UCS-2 as its = | |
default encoding and switched to UTF-16. NeXT did the same thing with = | |
NeXTSTEP (the precursor to iOS). Many JavaScript systems updated as = | |
well. | |
Now, here's what you've all been waiting for: the V8 runtime for = | |
JavaScript, which is what our node.js severs are built on, use UCS-2 = | |
internally as their encoding and are not capable of handing any code = | |
point outside the base 16 bit range (we call that the BMP, or Basic = | |
Multilingual Plane). V8 fundamentally has no ability to represent the = | |
U+1F604 that we need to make that smiley. | |
Danny confirmed this with the node guys today. Matt Ranney is going to = | |
talk to the V8 guys about it and see what they want to do about it. | |
Wow, you read though all of that? You rock. I'm humbled that you gave = | |
me so much of your attention. I feel that we've accomplished something = | |
together. Together we are now part of the tiny community of people who = | |
actually know anything about Unicode. You may have guessed by now that = | |
I am a text geek. I have had to implement java.lang.String for three = | |
separate projects. I love this stuff. If you have any questions about = | |
anything I've written here, or want more info so that you don't have to = | |
read the 670 page Unicode 6.0 core specification (there are many, many = | |
addenda as well) then please don't hesitate to hit me up. | |
Love, | |
Chris | |
p.s. Remember that this narrative is almost all ASCII characters, and = | |
ASCII is a subset of UTF-8. That smiley is the only non-ASCII = | |
character. In UTF-8 this email (everything up to, but not including my = | |
signature) is 8,553 bytes. In UTF-16 it is 17,102 bytes. In UTF-32 it = | |
would be 34,204 bytes. These space considerations are one of the many = | |
reasons we have multiple encodings.= |
Sorry, but that’s complete rubbish. UTF-16 is the preferred representation for Unicode text for a variety of very good reasons, which is why it’s used by the canonical reference implementation, the ICU project, as well as the Java and Objective-C runtimes.
Can you name any of them? UCS-2 may have been justifiable at some point, but I can't think of any good reason for UTF-16 to exist anywhere.
@isaacs you are right, this issue has actually been solved with the migration from node 0.6.x to 0.8.x.
The only reason people are using UTF-16 (esp. as programmer-visible internal representation) is that it was originally UCS-2 in the same language, and we are stuck with the strange java codepoint/index APIs that most people forget to use properly because that causes bugs that only appear in a few fringe languages (from a western-centric viewpoint), as opposed to utf-8 that has effects practically everywhere. Ironic that the emojis bring that problem back to the western world. :-)
So psyched to learn about the history of Unicode and character encodings, especially the historical anomaly of the battle between competing Japanese wireless providers!
Hey Chris, I just stumbled upon this page while trying to understand the UTF-8 encoding.
I'm trying to write a basic UTF-8 string handling library in C (my idea it to basically define utf8_t as an unsigned char pointer)
Anyway, I've skimmed through the unicode pdf a few tiems, and tried googling it about a dozen times, and I can't find any good information on how the more complex features of this encoding are represented.
So, here goes.
How are emoji (like, flag emoji especially) represented? are they 2 code points? how do you know there's a following code point? I know that usually the leading byte of a codepoint will set the top 4 bits depending on how many bytes are in the code point, does this work for emoji flags too?
ALSO why is there sometimes a leading flag byte? how do you know when the flag byte will be separate, or part of the first coding byte?
Sorry, but that’s complete rubbish. UTF-16 is the preferred representation for Unicode text for a variety of very good reasons, which is why it’s used by the canonical reference implementation, the ICU project, as well as the Java and Objective-C runtimes.
As for C, the wide characters (and the associated wide and multibyte string routines) in C were never intended for use with Unicode; they were intended for use in East Asian countries with pre-existing standards. They were designed with the intent that a single wide character represented something that the end user would regard as a character (i.e. something that could be processed as an individual unit); this is not true even with UCS-4, and so using the wide character routines and
wchar_t
for Unicode (whether yourwchar_t
is 16 or 32-bit) is and always has been a mistake.