mranney/emoji_sad.txt

## emoji_sad.txt
From: Chris DeSalvo <chris.desalvo@voxer.com>
Subject: Why we can't process Emoji anymore
Date: Thu, 12 Jan 2012 18:49:20 -0800
Message-Id: <AE459007-DF2E-4E41-B7A4-FA5C2A83025F@voxer.com>

--Apple-Mail=_6DEAA046-886A-4A03-8508-6FD077D18F8B
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=utf-8

If you are not interested in the technical details of why Emoji current =
do not work in our iOS client, you can stop reading now.

Many many years ago a Japanese cell phone carrier called SoftBank came =
up with the idea for emoji and built it into the cell phones that it =
sold for their network.  The problem they had was in deciding how to =
represent the characters in electronic form.  They decided to use =
Unicode code points in the private use areas.  This is a perfectly valid =
thing to do as long as your data stays completely within your product.  =
However, with text messages the data has to interoperate with other =
carriers' phones.

Unfortunately SoftBank decided to copyright their entire set of images, =
their encoding, etc etc etc and refused to license them to anyone.  So, =
when NTT and KDDI (two other Japanese carriers) decided that they wanted =
emoji they had to do their own implementations.  To make things even =
more sad they decided not to work with each other and gang up on =
SoftBank.  So, in Japan, there were three competing emoji standards that =
did not interoperate.

In 2010 Apple released iOS 2.2 and added support for the SoftBank =
implementation of emoji.  Since SoftBank would not license their emoji =
out for use on networks other than their own Apple agreed to only make =
the emoji keyboard visible on iPhones that were on the SoftBank network. =
 That's why you used to have to run an ad-ware app to make that keyboard =
visible.

Later in 2010 the Unicode consortium released version 6.0 of the Unicode =
standard.  (In case any cares, Unicode originated in 1987 as a joint =
research project between Xerox and Apple.)  The smart Unicode folks =
added all of emoji (about 740 glyphs) plus the new Indian Rupee sign, =
more symbols needed for several African languages, and hundreds more CJK =
symbols for, well, Chinese/Japanese/Korean (CJK also covers Vietnamese, =
but now, like then, nobody gives Vietnam any credit).

With iOS 5.0 Apple (wisely) decided to adopt Unicode 6.0.  The emoji =
keyboard was made available to all users and generates code points from =
their new Unicode 6.0 locations.  Apple also added this support to OS X =
Lion.

You may be asking, "So this all sounds great.  Why can't I type a smiley =
in Voxer and have the damn thing show up?"  Glad you asked.  Consider =
the following glyph:

=F0=9F=98=84
SMILING FACE WITH OPEN MOUTH AND SMILING EYES
Unicode: U+1F604 (U+D83D U+DE04), UTF-8: F0 9F 98 84

You can get this info for any character that OS X can render by bringing =
up the Character Viewer panel and right-clicking on a glyph and =
selecting "Copy Character Info".  So, what this shows us is that for =
this smiley face the Unicode code point is 0x1F604.  For those of you =
who are not hex-savvy that is the decimal number 128,516.  That's a =
pretty big number.

The code point that SoftBank had used was 0xFB55 (or 64,341 decimal).  =
That's a pretty tiny number.  You can represent 64,341 with just 16 =
bits.  Dealing with 16 bits is something computers do really well.  To =
represent 0x1F604 you need 17 bits.  Since bits come in 8-packs you end =
up using 24 total.  Computers hate odd numbers and dealing with a group =
of 3 bytes is a real pain.

I have to make a side-trip now and explain Unicode character encodings.  =
Different kinds of computer systems, and the networks that connect them, =
think of data in different ways.  Inside of the computer the processor =
thinks of data in terms defined by its physical properties.  An old =
Commodore 64 operated on one byte, 8 bits, at a time.  Later computers =
had 16-bit hardware, then 32, and now most of the computers you will =
encounter on your desk prefer to operate on data 64-bits (8 bytes) at a =
time.  Networks still like to think of data as a string of individual =
bytes and try to ignore any such logical groupings.  To represent the =
entire Unicode code space you need 21 bits.  That is a frustrating size. =
 Also, if you tend to work in Latin script (English, French, Italian, =
etc) where all of the codes you'll ever use fit neatly in 8 bits (the =
ISO Latin-1 set) then it is wasteful to have to use 24 bits (21 rounded =
up to the next byte boundary) because those top 17 bits will always be =
unused.  So what do you do?  You make alternate encodings.

There are many encodings, the most common being UTF-8 and UTF-16.  There =
is also a UTF-32, but it isn't very popular since it's not =
space-friendly.  UTF-8 has the nice property that all of the original =
ASCII characters preserve their encoding.  So far in this email every =
single character I've typed (other than the smiley) has been an ASCII =
character and fits neatly in 7 bits.  One byte per character is really =
friendly to work with, fits nicely in memory, and doesn't take much =
space on disk.  If you sometimes need to represent a big character, like =
that smiley up there, then you do that with a multi-byte sequence.  As =
we can see in the info above the UTF-8 for that smiley is the 4-byte =
sequence [F0 9F 98 84].  Make a file with those four byes in it and open =
it in any editor that is UTF-8 aware and you'll get that smiley.

Some Unicode-aware programming languages such as Java, Objective-C, and =
(most) JavaScript systems use the UTF-16 encoding internally.  UTF-16 =
has some really good properties of its own that I won't digress into =
here.  The thing to note is that it uses 16 bits for most characters.  =
So, whereas a small letter 'a' would be the single byte 0x61in ASCII or =
UTF-8, in UTF-16 it is the 2-byte 0x0061.  Note that the SoftBank 0xFB55 =
fits nicely in that 16-bit space.  Hmm, but our smiley has a Unicode =
value of U+1F604 (we use U+ when throwing Unicode values around in =
hexadecimal) and that will NOT fit in 16 bits.  Remember, we need 17.  =
So what do we do?  Well, the Unicode guys are really smart (UTF-8 is =
fucking brilliant, no, really!) and they invented a thing called a =
"surrogate pair".  With a surrogate pair you can use two 16-bit values =
to encode that code point that is too big to fit into a single 16-bit =
field.  Surrogate pairs have a specific bit pattern in their top bits =
that lets UTF-16 compliant systems know that they are a surrogate pair =
that represent a single code point and not two separate UTF-16 code =
points.  In the example smiley above we find that the UTF-16 surrogate =
pair that encodes U+1F604 is [U+D83D U+DE04].  Put those four bytes into =
a file and open it in any program that understands UTF-16 and you'll see =
that smiley.  He really is quite cheery.

So, I've already said that Objective-C and Java and (most) JavaScript =
systems use UTF-16 internally so we should be all cool, right?  Well, =
see, it was that "(most)" that is the problem.

Before there was UTF-16 there was another encoding used by Java and =
JavaScript called UCS-2.  UCS-2 is a strict 16-bit encoding.  You get 16 =
bits per character and no more.  So how do you represent U+1F604 which =
needs 17 bits?  You don't.  Period.  UCS-2 has no notion of surrogate =
pairs.  Through most of time this was ok because the Unicode consortium =
hadn't defined many code points beyond the 16 bit range so there was =
nothing out there to encode.  But in 1996 it was clear that to encode =
all the CJK languages (and Vietnamese!) that we'd start needing those =
17+ bit code points.  SUN updated Java to stop using UCS-2 as its =
default encoding and switched to UTF-16.  NeXT did the same thing with =
NeXTSTEP (the precursor to iOS).  Many JavaScript systems updated as =
well.

Now, here's what you've all been waiting for:  the V8 runtime for =
JavaScript, which is what our node.js severs are built on, use UCS-2 =
internally as their encoding and are not capable of handing any code =
point outside the base 16 bit range (we call that the BMP, or Basic =
Multilingual Plane).  V8 fundamentally has no ability to represent the =
U+1F604 that we need to make that smiley.

Danny confirmed this with the node guys today.  Matt Ranney is going to =
talk to the V8 guys about it and see what they want to do about it.

Wow, you read though all of that?  You rock.  I'm humbled that you gave =
me so much of your attention.  I feel that we've accomplished something =
together.  Together we are now part of the tiny community of people who =
actually know anything about Unicode.  You may have guessed by now that =
I am a text geek.  I have had to implement java.lang.String for three =
separate projects.  I love this stuff.  If you have any questions about =
anything I've written here, or want more info so that you don't have to =
read the 670 page Unicode 6.0 core specification (there are many, many =
addenda as well) then please don't hesitate to hit me up.

Love,
Chris

p.s.  Remember that this narrative is almost all ASCII characters, and =
ASCII is a subset of UTF-8.  That smiley is the only non-ASCII =
character.  In UTF-8 this email (everything up to, but not including my =
signature) is 8,553 bytes.  In UTF-16 it is 17,102 bytes.  In UTF-32 it =
would be 34,204 bytes.  These space considerations are one of the many =
reasons we have multiple encodings.=
	From: Chris DeSalvo <chris.desalvo@voxer.com>
	Subject: Why we can't process Emoji anymore
	Date: Thu, 12 Jan 2012 18:49:20 -0800
	Message-Id: <AE459007-DF2E-4E41-B7A4-FA5C2A83025F@voxer.com>

	--Apple-Mail=_6DEAA046-886A-4A03-8508-6FD077D18F8B
	Content-Transfer-Encoding: quoted-printable
	Content-Type: text/plain;
	charset=utf-8

	If you are not interested in the technical details of why Emoji current =
	do not work in our iOS client, you can stop reading now.

	Many many years ago a Japanese cell phone carrier called SoftBank came =
	up with the idea for emoji and built it into the cell phones that it =
	sold for their network. The problem they had was in deciding how to =
	represent the characters in electronic form. They decided to use =
	Unicode code points in the private use areas. This is a perfectly valid =
	thing to do as long as your data stays completely within your product. =
	However, with text messages the data has to interoperate with other =
	carriers' phones.

	Unfortunately SoftBank decided to copyright their entire set of images, =
	their encoding, etc etc etc and refused to license them to anyone. So, =
	when NTT and KDDI (two other Japanese carriers) decided that they wanted =
	emoji they had to do their own implementations. To make things even =
	more sad they decided not to work with each other and gang up on =
	SoftBank. So, in Japan, there were three competing emoji standards that =
	did not interoperate.

	In 2010 Apple released iOS 2.2 and added support for the SoftBank =
	implementation of emoji. Since SoftBank would not license their emoji =
	out for use on networks other than their own Apple agreed to only make =
	the emoji keyboard visible on iPhones that were on the SoftBank network. =
	That's why you used to have to run an ad-ware app to make that keyboard =
	visible.

	Later in 2010 the Unicode consortium released version 6.0 of the Unicode =
	standard. (In case any cares, Unicode originated in 1987 as a joint =
	research project between Xerox and Apple.) The smart Unicode folks =
	added all of emoji (about 740 glyphs) plus the new Indian Rupee sign, =
	more symbols needed for several African languages, and hundreds more CJK =
	symbols for, well, Chinese/Japanese/Korean (CJK also covers Vietnamese, =
	but now, like then, nobody gives Vietnam any credit).

	With iOS 5.0 Apple (wisely) decided to adopt Unicode 6.0. The emoji =
	keyboard was made available to all users and generates code points from =
	their new Unicode 6.0 locations. Apple also added this support to OS X =
	Lion.

	You may be asking, "So this all sounds great. Why can't I type a smiley =
	in Voxer and have the damn thing show up?" Glad you asked. Consider =
	the following glyph:

	=F0=9F=98=84
	SMILING FACE WITH OPEN MOUTH AND SMILING EYES
	Unicode: U+1F604 (U+D83D U+DE04), UTF-8: F0 9F 98 84

	You can get this info for any character that OS X can render by bringing =
	up the Character Viewer panel and right-clicking on a glyph and =
	selecting "Copy Character Info". So, what this shows us is that for =
	this smiley face the Unicode code point is 0x1F604. For those of you =
	who are not hex-savvy that is the decimal number 128,516. That's a =
	pretty big number.

	The code point that SoftBank had used was 0xFB55 (or 64,341 decimal). =
	That's a pretty tiny number. You can represent 64,341 with just 16 =
	bits. Dealing with 16 bits is something computers do really well. To =
	represent 0x1F604 you need 17 bits. Since bits come in 8-packs you end =
	up using 24 total. Computers hate odd numbers and dealing with a group =
	of 3 bytes is a real pain.

	I have to make a side-trip now and explain Unicode character encodings. =
	Different kinds of computer systems, and the networks that connect them, =
	think of data in different ways. Inside of the computer the processor =
	thinks of data in terms defined by its physical properties. An old =
	Commodore 64 operated on one byte, 8 bits, at a time. Later computers =
	had 16-bit hardware, then 32, and now most of the computers you will =
	encounter on your desk prefer to operate on data 64-bits (8 bytes) at a =
	time. Networks still like to think of data as a string of individual =
	bytes and try to ignore any such logical groupings. To represent the =
	entire Unicode code space you need 21 bits. That is a frustrating size. =
	Also, if you tend to work in Latin script (English, French, Italian, =
	etc) where all of the codes you'll ever use fit neatly in 8 bits (the =
	ISO Latin-1 set) then it is wasteful to have to use 24 bits (21 rounded =
	up to the next byte boundary) because those top 17 bits will always be =
	unused. So what do you do? You make alternate encodings.

	There are many encodings, the most common being UTF-8 and UTF-16. There =
	is also a UTF-32, but it isn't very popular since it's not =
	space-friendly. UTF-8 has the nice property that all of the original =
	ASCII characters preserve their encoding. So far in this email every =
	single character I've typed (other than the smiley) has been an ASCII =
	character and fits neatly in 7 bits. One byte per character is really =
	friendly to work with, fits nicely in memory, and doesn't take much =
	space on disk. If you sometimes need to represent a big character, like =
	that smiley up there, then you do that with a multi-byte sequence. As =
	we can see in the info above the UTF-8 for that smiley is the 4-byte =
	sequence [F0 9F 98 84]. Make a file with those four byes in it and open =
	it in any editor that is UTF-8 aware and you'll get that smiley.

	Some Unicode-aware programming languages such as Java, Objective-C, and =
	(most) JavaScript systems use the UTF-16 encoding internally. UTF-16 =
	has some really good properties of its own that I won't digress into =
	here. The thing to note is that it uses 16 bits for most characters. =
	So, whereas a small letter 'a' would be the single byte 0x61in ASCII or =
	UTF-8, in UTF-16 it is the 2-byte 0x0061. Note that the SoftBank 0xFB55 =
	fits nicely in that 16-bit space. Hmm, but our smiley has a Unicode =
	value of U+1F604 (we use U+ when throwing Unicode values around in =
	hexadecimal) and that will NOT fit in 16 bits. Remember, we need 17. =
	So what do we do? Well, the Unicode guys are really smart (UTF-8 is =
	fucking brilliant, no, really!) and they invented a thing called a =
	"surrogate pair". With a surrogate pair you can use two 16-bit values =
	to encode that code point that is too big to fit into a single 16-bit =
	field. Surrogate pairs have a specific bit pattern in their top bits =
	that lets UTF-16 compliant systems know that they are a surrogate pair =
	that represent a single code point and not two separate UTF-16 code =
	points. In the example smiley above we find that the UTF-16 surrogate =
	pair that encodes U+1F604 is [U+D83D U+DE04]. Put those four bytes into =
	a file and open it in any program that understands UTF-16 and you'll see =
	that smiley. He really is quite cheery.

	So, I've already said that Objective-C and Java and (most) JavaScript =
	systems use UTF-16 internally so we should be all cool, right? Well, =
	see, it was that "(most)" that is the problem.

	Before there was UTF-16 there was another encoding used by Java and =
	JavaScript called UCS-2. UCS-2 is a strict 16-bit encoding. You get 16 =
	bits per character and no more. So how do you represent U+1F604 which =
	needs 17 bits? You don't. Period. UCS-2 has no notion of surrogate =
	pairs. Through most of time this was ok because the Unicode consortium =
	hadn't defined many code points beyond the 16 bit range so there was =
	nothing out there to encode. But in 1996 it was clear that to encode =
	all the CJK languages (and Vietnamese!) that we'd start needing those =
	17+ bit code points. SUN updated Java to stop using UCS-2 as its =
	default encoding and switched to UTF-16. NeXT did the same thing with =
	NeXTSTEP (the precursor to iOS). Many JavaScript systems updated as =
	well.

	Now, here's what you've all been waiting for: the V8 runtime for =
	JavaScript, which is what our node.js severs are built on, use UCS-2 =
	internally as their encoding and are not capable of handing any code =
	point outside the base 16 bit range (we call that the BMP, or Basic =
	Multilingual Plane). V8 fundamentally has no ability to represent the =
	U+1F604 that we need to make that smiley.

	Danny confirmed this with the node guys today. Matt Ranney is going to =
	talk to the V8 guys about it and see what they want to do about it.

	Wow, you read though all of that? You rock. I'm humbled that you gave =
	me so much of your attention. I feel that we've accomplished something =
	together. Together we are now part of the tiny community of people who =
	actually know anything about Unicode. You may have guessed by now that =
	I am a text geek. I have had to implement java.lang.String for three =
	separate projects. I love this stuff. If you have any questions about =
	anything I've written here, or want more info so that you don't have to =
	read the 670 page Unicode 6.0 core specification (there are many, many =
	addenda as well) then please don't hesitate to hit me up.

	Love,
	Chris

	p.s. Remember that this narrative is almost all ASCII characters, and =
	ASCII is a subset of UTF-8. That smiley is the only non-ASCII =
	character. In UTF-8 this email (everything up to, but not including my =
	signature) is 8,553 bytes. In UTF-16 it is 17,102 bytes. In UTF-32 it =
	would be 34,204 bytes. These space considerations are one of the many =
	reasons we have multiple encodings.=