eonil/gist:4a457009cc8ac2ca9cb7

## gistfile1.md

      
    Raw
  

              gistfile1.md
            
          
    Unicode Note

The core part of Unicode is Unicode Scalar Value. This represents core component to build a Unicode text. Also provides most reliable unit to process text data.
Unicode Scalar Value is equal to Code Point except surrogate pairs.
Code Unit is a component of each encoding algorithm. Defined differently by the encodings.
Grapheme Cluster smallest unit to represent human recgonizable symbol.


A Grapheme Cluster is built with multiple Scalar Values.


A Scalar Value is built with multiple Code Points. Usually one, but surrogate pairs build an exception case.


A code point is bult with multiple Code Units. Code Unit compositions are defined by transforms.


UTF-8 uses one Code Unit for ASCII characters.


UTF-8 provides most stable and reliable encoding because there's no ambiguity.


UTF-8 provides maximum Unix compatibility.


Use UTF-8 everywhere where you write a new software. It is the best encoding ever invented.


UTF-16 is ambiguous and needlessly complex. Must be avoided as much as possible.


UTF-16 is divided into UTF-16LE and UTF-16BE, and there's no relaible way to differentiate them.


UCS-2 means code points defined in BMP.


UCS-2 is roughly equals to UTF-16 except surrogate pairs.


UCS-2 is not a reliable encoding.


Currently (2014), UTF-32 is exactly equal to UCS-4, and mapped to code-point 1 on 1 manner.


UTF-32's 1:1 mapping is accidental result, and not an intentional design. Subject to change at any time.


Win32 and Cocoa (NSString) both uses UTF-16 as its default internal representation. Because they're old.


NSString is UTF-16 encoded by default, but not guaranteed to be.


It is almost impossible to reliably convert NSRange based on UTF-16 code-unit countings into Swift's String.Index.
Because there's no good way to select grapheme cluster range.