Skip to content

Instantly share code, notes, and snippets.

@mzsanford
Last active January 26, 2016 17:57
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mzsanford/7efb004be3c153e9891d to your computer and use it in GitHub Desktop.
Save mzsanford/7efb004be3c153e9891d to your computer and use it in GitHub Desktop.
FB31 HEBREW LETTER BET WITH DAGESH
2.1.1 :006 > combined = "\uFB31"
=> "בּ"
2.1.1 :007 > combined.codepoints
=> [64305]
2.1.1 :008 > combined.to_nfc.codepoints
=> [1489, 1468]
So, to_nfc (from [unf](https://github.com/knu/ruby-unf)) changes it to two codepoints. What about other normalization forms?
2.1.1 :020 > combined.to_nfd.codepoints
=> [1489, 1468]
2.1.1 :021 > combined.to_nfc.codepoints
=> [1489, 1468]
2.1.1 :022 > combined.to_nfd.codepoints
=> [1489, 1468]
2.1.1 :023 > combined.to_nfkc.codepoints
=> [1489, 1468]
2.1.1 :024 > combined.to_nfkd.codepoints
=> [1489, 1468]
So, all normalization leads to this.

Root Cause

The NFC used by Twitter goes through Canonical Decomposition followed by Canonical Composition¹. This implys to me that decomposition splits it and then it is inelligible for canonical composition. The best way to track what it does is the Unicode Character Database (UCD). Searching the UCD I can see U+FB31 is in the Composition Exclusions file under the Script Specifics heading. That heading² is described as:

canonically decomposable characters that are generally not the preferred form for particular scripts.

My Summary

So, sorry to say, this is pretty much "by design". Unicode works hard to maintain the ability to represent scholarly/academic texts³ but normalization can't always accommodate that data. When you normalize you sometimes have to pick one thing over another. When you pick a side it makes more sense to favor common usage over scholarly. In Twitter's attempt to represent a "character" in some understandable and relateable way it sometimes has to favor the common forms like this. At least that was my thoughts when I wrote it. Might be wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment