ahw/what-is-utf-8

## what-is-utf-8
What is UTF-8?
==============
UCS and Unicode are first of all just code tables that assign integer numbers
to characters. There exist several alternatives for how a sequence of such
characters or their respective integer values can be represented as a sequence
of bytes. The two most obvious encodings store Unicode text as sequences of
either 2 or 4 bytes sequences. The official terms for these encodings are UCS-2
and UCS-4, respectively. Unless otherwise specified, the most significant byte
comes first in these (Bigendian convention). An ASCII or Latin-1 file can be
transformed into a UCS-2 file by simply inserting a 0x00 byte in front of every
ASCII byte. If we want to have a UCS-4 file, we have to insert three 0x00 bytes
instead before every ASCII byte.

Using UCS-2 (or UCS-4) under Unix would lead to very severe problems. Strings
with these encodings can contain as parts of many wide characters bytes like
“\0” or “/” which have a special meaning in filenames and other C library
function parameters. In addition, the majority of UNIX tools expects ASCII
files and cannot read 16-bit words as characters without major modifications.
For these reasons, UCS-2 is not a suitable external encoding of Unicode in
filenames, text files, environment variables, etc.

The UTF-8 encoding defined in ISO 10646-1:2000 Annex D and also described in
RFC 3629 as well as section 3.9 of the Unicode 4.0 standard does not have these
problems. It is clearly the way to go for using Unicode under Unix-style
operating systems.

UTF-8 has the following properties:

- UCS characters U+0000 to U+007F (ASCII) are encoded simply as bytes 0x00
  to 0x7F (ASCII compatibility). This means that files and strings which
  contain only 7-bit ASCII characters have the same encoding under both
  ASCII and UTF-8.

- All UCS characters greater than U+007F are encoded as a sequence of
  several bytes, each of which has the most significant bit set. Therefore,
  no ASCII byte (0x00-0x7F) can appear as part of any other character.

- The first byte of a multibyte sequence that represents a non-ASCII
  character is always in the range 0xC0 to 0xFD and it indicates how many
  bytes follow for this character. All further bytes in a multibyte sequence
  are in the range 0x80 to 0xBF. This allows easy resynchronization and
  makes the encoding stateless and robust against missing bytes.  - All
  possible 231 UCS codes can be encoded.

- UTF-8 encoded characters may theoretically be up to six bytes long,
  however 16-bit BMP characters are only up to three bytes long.  - The
  sorting order of Bigendian UCS-4 byte strings is preserved.

- The bytes 0xFE and 0xFF are never used in the UTF-8 encoding.

The official name and spelling of this encoding is UTF-8, where UTF stands
for UCS Transformation Format. Please do not write UTF-8 in any
documentation text in other ways (such as utf8 or UTF_8), unless of
course you refer to a variable name and not the encoding itself.

Source: http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
See also: http://www.joelonsoftware.com/articles/Unicode.html
	What is UTF-8?
	==============
	UCS and Unicode are first of all just code tables that assign integer numbers
	to characters. There exist several alternatives for how a sequence of such
	characters or their respective integer values can be represented as a sequence
	of bytes. The two most obvious encodings store Unicode text as sequences of
	either 2 or 4 bytes sequences. The official terms for these encodings are UCS-2
	and UCS-4, respectively. Unless otherwise specified, the most significant byte
	comes first in these (Bigendian convention). An ASCII or Latin-1 file can be
	transformed into a UCS-2 file by simply inserting a 0x00 byte in front of every
	ASCII byte. If we want to have a UCS-4 file, we have to insert three 0x00 bytes
	instead before every ASCII byte.

	Using UCS-2 (or UCS-4) under Unix would lead to very severe problems. Strings
	with these encodings can contain as parts of many wide characters bytes like
	“\0” or “/” which have a special meaning in filenames and other C library
	function parameters. In addition, the majority of UNIX tools expects ASCII
	files and cannot read 16-bit words as characters without major modifications.
	For these reasons, UCS-2 is not a suitable external encoding of Unicode in
	filenames, text files, environment variables, etc.

	The UTF-8 encoding defined in ISO 10646-1:2000 Annex D and also described in
	RFC 3629 as well as section 3.9 of the Unicode 4.0 standard does not have these
	problems. It is clearly the way to go for using Unicode under Unix-style
	operating systems.

	UTF-8 has the following properties:

	- UCS characters U+0000 to U+007F (ASCII) are encoded simply as bytes 0x00
	to 0x7F (ASCII compatibility). This means that files and strings which
	contain only 7-bit ASCII characters have the same encoding under both
	ASCII and UTF-8.

	- All UCS characters greater than U+007F are encoded as a sequence of
	several bytes, each of which has the most significant bit set. Therefore,
	no ASCII byte (0x00-0x7F) can appear as part of any other character.

	- The first byte of a multibyte sequence that represents a non-ASCII
	character is always in the range 0xC0 to 0xFD and it indicates how many
	bytes follow for this character. All further bytes in a multibyte sequence
	are in the range 0x80 to 0xBF. This allows easy resynchronization and
	makes the encoding stateless and robust against missing bytes. - All
	possible 231 UCS codes can be encoded.

	- UTF-8 encoded characters may theoretically be up to six bytes long,
	however 16-bit BMP characters are only up to three bytes long. - The
	sorting order of Bigendian UCS-4 byte strings is preserved.

	- The bytes 0xFE and 0xFF are never used in the UTF-8 encoding.

	The official name and spelling of this encoding is UTF-8, where UTF stands
	for UCS Transformation Format. Please do not write UTF-8 in any
	documentation text in other ways (such as utf8 or UTF_8), unless of
	course you refer to a variable name and not the encoding itself.

	Source: http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
	See also: http://www.joelonsoftware.com/articles/Unicode.html