Created
May 17, 2020 18:28
-
-
Save cpurdy/0ee3925aff4605b29af2e7aa64b8c1f4 to your computer and use it in GitHub Desktop.
Unicode GeneralCategory look-up from prepared CharCats.dat binary file
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
/** | |
* The Unicode General Category of the character. | |
* | |
* This information is field 2 in the `UnicodeData.txt` data file from the Unicode Consortium. | |
* From [https://www.unicode.org/reports/tr44/#General_Category_Values]: | |
* | |
* > This is a useful breakdown into various character types which can be used as a default | |
* > categorization in implementations. For the property values, see | |
* > [General Category Values](https://www.unicode.org/reports/tr44/#General_Category_Values). | |
* | |
* This information is stored in the binary file "CharCats.dat" in this package. For a codepoint | |
* `n`, the n-th byte of the file is the ordinal of the `GeneralCategory` enum value for the | |
* character. | |
*/ | |
GeneralCategory unicodeCategory.get() | |
{ | |
Byte[] categoriesByCodepoint = #./CharCats.dat; | |
return codepoint < categoriesByCodepoint.size | |
? GeneralCategory.values[categoriesByCodepoint[codepoint].toInt()] | |
: Unassigned; | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment