Skip to content

Instantly share code, notes, and snippets.

@eonil
Last active August 29, 2015 14:12
Show Gist options
  • Save eonil/4a457009cc8ac2ca9cb7 to your computer and use it in GitHub Desktop.
Save eonil/4a457009cc8ac2ca9cb7 to your computer and use it in GitHub Desktop.
Unicode Note

Unicode Note

The core part of Unicode is Unicode Scalar Value. This represents core component to build a Unicode text. Also provides most reliable unit to process text data.

Unicode Scalar Value is equal to Code Point except surrogate pairs.

Code Unit is a component of each encoding algorithm. Defined differently by the encodings.

Grapheme Cluster smallest unit to represent human recgonizable symbol.

  • A Grapheme Cluster is built with multiple Scalar Values.

  • A Scalar Value is built with multiple Code Points. Usually one, but surrogate pairs build an exception case.

  • A code point is bult with multiple Code Units. Code Unit compositions are defined by transforms.

  • UTF-8 uses one Code Unit for ASCII characters.

  • UTF-8 provides most stable and reliable encoding because there's no ambiguity.

  • UTF-8 provides maximum Unix compatibility.

  • Use UTF-8 everywhere where you write a new software. It is the best encoding ever invented.

  • UTF-16 is ambiguous and needlessly complex. Must be avoided as much as possible.

  • UTF-16 is divided into UTF-16LE and UTF-16BE, and there's no relaible way to differentiate them.

  • UCS-2 means code points defined in BMP.

  • UCS-2 is roughly equals to UTF-16 except surrogate pairs.

  • UCS-2 is not a reliable encoding.

  • Currently (2014), UTF-32 is exactly equal to UCS-4, and mapped to code-point 1 on 1 manner.

  • UTF-32's 1:1 mapping is accidental result, and not an intentional design. Subject to change at any time.

  • Win32 and Cocoa (NSString) both uses UTF-16 as its default internal representation. Because they're old.

  • NSString is UTF-16 encoded by default, but not guaranteed to be.

  • It is almost impossible to reliably convert NSRange based on UTF-16 code-unit countings into Swift's String.Index. Because there's no good way to select grapheme cluster range.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment