The core part of Unicode is Unicode Scalar Value
. This represents core component to build a Unicode text. Also provides most reliable unit to process text data.
Unicode Scalar Value is equal to Code Point
except surrogate pairs.
Code Unit
is a component of each encoding algorithm. Defined differently by the encodings.
Grapheme Cluster
smallest unit to represent human recgonizable symbol.
-
A Grapheme Cluster is built with multiple Scalar Values.
-
A Scalar Value is built with multiple Code Points. Usually one, but surrogate pairs build an exception case.
-
A code point is bult with multiple Code Units. Code Unit compositions are defined by transforms.
-
UTF-8 uses one Code Unit for ASCII characters.
-
UTF-8 provides most stable and reliable encoding because there's no ambiguity.
-
UTF-8 provides maximum Unix compatibility.
-
Use UTF-8 everywhere where you write a new software. It is the best encoding ever invented.
-
UTF-16 is ambiguous and needlessly complex. Must be avoided as much as possible.
-
UTF-16 is divided into UTF-16LE and UTF-16BE, and there's no relaible way to differentiate them.
-
UCS-2 means code points defined in BMP.
-
UCS-2 is roughly equals to UTF-16 except surrogate pairs.
-
UCS-2 is not a reliable encoding.
-
Currently (2014), UTF-32 is exactly equal to UCS-4, and mapped to code-point 1 on 1 manner.
-
UTF-32's 1:1 mapping is accidental result, and not an intentional design. Subject to change at any time.
-
Win32 and Cocoa (
NSString
) both uses UTF-16 as its default internal representation. Because they're old. -
NSString
is UTF-16 encoded by default, but not guaranteed to be. -
It is almost impossible to reliably convert
NSRange
based on UTF-16 code-unit countings into Swift'sString.Index
. Because there's no good way to select grapheme cluster range.