Skip to content

Instantly share code, notes, and snippets.

@saket
Forked from mranney/emoji_sad.txt
Last active August 29, 2015 14:19
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save saket/30f904b331f1efe0a639 to your computer and use it in GitHub Desktop.
Save saket/30f904b331f1efe0a639 to your computer and use it in GitHub Desktop.
From: Chris DeSalvo <chris.desalvo@voxer.com>
Subject: Why we can't process Emoji anymore
Date: Thu, 12 Jan 2012 18:49:20 -0800
Message-Id: <AE459007-DF2E-4E41-B7A4-FA5C2A83025F@voxer.com>
--Apple-Mail=_6DEAA046-886A-4A03-8508-6FD077D18F8B
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
charset=utf-8
If you are not interested in the technical details of why Emoji current =
do not work in our iOS client, you can stop reading now.
Many many years ago a Japanese cell phone carrier called SoftBank came =
up with the idea for emoji and built it into the cell phones that it =
sold for their network. The problem they had was in deciding how to =
represent the characters in electronic form. They decided to use =
Unicode code points in the private use areas. This is a perfectly valid =
thing to do as long as your data stays completely within your product. =
However, with text messages the data has to interoperate with other =
carriers' phones.
Unfortunately SoftBank decided to copyright their entire set of images, =
their encoding, etc etc etc and refused to license them to anyone. So, =
when NTT and KDDI (two other Japanese carriers) decided that they wanted =
emoji they had to do their own implementations. To make things even =
more sad they decided not to work with each other and gang up on =
SoftBank. So, in Japan, there were three competing emoji standards that =
did not interoperate.
In 2010 Apple released iOS 2.2 and added support for the SoftBank =
implementation of emoji. Since SoftBank would not license their emoji =
out for use on networks other than their own Apple agreed to only make =
the emoji keyboard visible on iPhones that were on the SoftBank network. =
That's why you used to have to run an ad-ware app to make that keyboard =
visible.
Later in 2010 the Unicode consortium released version 6.0 of the Unicode =
standard. (In case any cares, Unicode originated in 1987 as a joint =
research project between Xerox and Apple.) The smart Unicode folks =
added all of emoji (about 740 glyphs) plus the new Indian Rupee sign, =
more symbols needed for several African languages, and hundreds more CJK =
symbols for, well, Chinese/Japanese/Korean (CJK also covers Vietnamese, =
but now, like then, nobody gives Vietnam any credit).
With iOS 5.0 Apple (wisely) decided to adopt Unicode 6.0. The emoji =
keyboard was made available to all users and generates code points from =
their new Unicode 6.0 locations. Apple also added this support to OS X =
Lion.
You may be asking, "So this all sounds great. Why can't I type a smiley =
in Voxer and have the damn thing show up?" Glad you asked. Consider =
the following glyph:
=F0=9F=98=84
SMILING FACE WITH OPEN MOUTH AND SMILING EYES
Unicode: U+1F604 (U+D83D U+DE04), UTF-8: F0 9F 98 84
You can get this info for any character that OS X can render by bringing =
up the Character Viewer panel and right-clicking on a glyph and =
selecting "Copy Character Info". So, what this shows us is that for =
this smiley face the Unicode code point is 0x1F604. For those of you =
who are not hex-savvy that is the decimal number 128,516. That's a =
pretty big number.
The code point that SoftBank had used was 0xFB55 (or 64,341 decimal). =
That's a pretty tiny number. You can represent 64,341 with just 16 =
bits. Dealing with 16 bits is something computers do really well. To =
represent 0x1F604 you need 17 bits. Since bits come in 8-packs you end =
up using 24 total. Computers hate odd numbers and dealing with a group =
of 3 bytes is a real pain.
I have to make a side-trip now and explain Unicode character encodings. =
Different kinds of computer systems, and the networks that connect them, =
think of data in different ways. Inside of the computer the processor =
thinks of data in terms defined by its physical properties. An old =
Commodore 64 operated on one byte, 8 bits, at a time. Later computers =
had 16-bit hardware, then 32, and now most of the computers you will =
encounter on your desk prefer to operate on data 64-bits (8 bytes) at a =
time. Networks still like to think of data as a string of individual =
bytes and try to ignore any such logical groupings. To represent the =
entire Unicode code space you need 21 bits. That is a frustrating size. =
Also, if you tend to work in Latin script (English, French, Italian, =
etc) where all of the codes you'll ever use fit neatly in 8 bits (the =
ISO Latin-1 set) then it is wasteful to have to use 24 bits (21 rounded =
up to the next byte boundary) because those top 17 bits will always be =
unused. So what do you do? You make alternate encodings.
There are many encodings, the most common being UTF-8 and UTF-16. There =
is also a UTF-32, but it isn't very popular since it's not =
space-friendly. UTF-8 has the nice property that all of the original =
ASCII characters preserve their encoding. So far in this email every =
single character I've typed (other than the smiley) has been an ASCII =
character and fits neatly in 7 bits. One byte per character is really =
friendly to work with, fits nicely in memory, and doesn't take much =
space on disk. If you sometimes need to represent a big character, like =
that smiley up there, then you do that with a multi-byte sequence. As =
we can see in the info above the UTF-8 for that smiley is the 4-byte =
sequence [F0 9F 98 84]. Make a file with those four byes in it and open =
it in any editor that is UTF-8 aware and you'll get that smiley.
Some Unicode-aware programming languages such as Java, Objective-C, and =
(most) JavaScript systems use the UTF-16 encoding internally. UTF-16 =
has some really good properties of its own that I won't digress into =
here. The thing to note is that it uses 16 bits for most characters. =
So, whereas a small letter 'a' would be the single byte 0x61in ASCII or =
UTF-8, in UTF-16 it is the 2-byte 0x0061. Note that the SoftBank 0xFB55 =
fits nicely in that 16-bit space. Hmm, but our smiley has a Unicode =
value of U+1F604 (we use U+ when throwing Unicode values around in =
hexadecimal) and that will NOT fit in 16 bits. Remember, we need 17. =
So what do we do? Well, the Unicode guys are really smart (UTF-8 is =
fucking brilliant, no, really!) and they invented a thing called a =
"surrogate pair". With a surrogate pair you can use two 16-bit values =
to encode that code point that is too big to fit into a single 16-bit =
field. Surrogate pairs have a specific bit pattern in their top bits =
that lets UTF-16 compliant systems know that they are a surrogate pair =
that represent a single code point and not two separate UTF-16 code =
points. In the example smiley above we find that the UTF-16 surrogate =
pair that encodes U+1F604 is [U+D83D U+DE04]. Put those four bytes into =
a file and open it in any program that understands UTF-16 and you'll see =
that smiley. He really is quite cheery.
So, I've already said that Objective-C and Java and (most) JavaScript =
systems use UTF-16 internally so we should be all cool, right? Well, =
see, it was that "(most)" that is the problem.
Before there was UTF-16 there was another encoding used by Java and =
JavaScript called UCS-2. UCS-2 is a strict 16-bit encoding. You get 16 =
bits per character and no more. So how do you represent U+1F604 which =
needs 17 bits? You don't. Period. UCS-2 has no notion of surrogate =
pairs. Through most of time this was ok because the Unicode consortium =
hadn't defined many code points beyond the 16 bit range so there was =
nothing out there to encode. But in 1996 it was clear that to encode =
all the CJK languages (and Vietnamese!) that we'd start needing those =
17+ bit code points. SUN updated Java to stop using UCS-2 as its =
default encoding and switched to UTF-16. NeXT did the same thing with =
NeXTSTEP (the precursor to iOS). Many JavaScript systems updated as =
well.
Now, here's what you've all been waiting for: the V8 runtime for =
JavaScript, which is what our node.js severs are built on, use UCS-2 =
internally as their encoding and are not capable of handing any code =
point outside the base 16 bit range (we call that the BMP, or Basic =
Multilingual Plane). V8 fundamentally has no ability to represent the =
U+1F604 that we need to make that smiley.
Danny confirmed this with the node guys today. Matt Ranney is going to =
talk to the V8 guys about it and see what they want to do about it.
Wow, you read though all of that? You rock. I'm humbled that you gave =
me so much of your attention. I feel that we've accomplished something =
together. Together we are now part of the tiny community of people who =
actually know anything about Unicode. You may have guessed by now that =
I am a text geek. I have had to implement java.lang.String for three =
separate projects. I love this stuff. If you have any questions about =
anything I've written here, or want more info so that you don't have to =
read the 670 page Unicode 6.0 core specification (there are many, many =
addenda as well) then please don't hesitate to hit me up.
Love,
Chris
p.s. Remember that this narrative is almost all ASCII characters, and =
ASCII is a subset of UTF-8. That smiley is the only non-ASCII =
character. In UTF-8 this email (everything up to, but not including my =
signature) is 8,553 bytes. In UTF-16 it is 17,102 bytes. In UTF-32 it =
would be 34,204 bytes. These space considerations are one of the many =
reasons we have multiple encodings.=
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment