Skip to content

Instantly share code, notes, and snippets.

@cpurdy
Created May 17, 2020 18:28
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save cpurdy/0ee3925aff4605b29af2e7aa64b8c1f4 to your computer and use it in GitHub Desktop.
Save cpurdy/0ee3925aff4605b29af2e7aa64b8c1f4 to your computer and use it in GitHub Desktop.
Unicode GeneralCategory look-up from prepared CharCats.dat binary file
/**
* The Unicode General Category of the character.
*
* This information is field 2 in the `UnicodeData.txt` data file from the Unicode Consortium.
* From [https://www.unicode.org/reports/tr44/#General_Category_Values]:
*
* > This is a useful breakdown into various character types which can be used as a default
* > categorization in implementations. For the property values, see
* > [General Category Values](https://www.unicode.org/reports/tr44/#General_Category_Values).
*
* This information is stored in the binary file "CharCats.dat" in this package. For a codepoint
* `n`, the n-th byte of the file is the ordinal of the `GeneralCategory` enum value for the
* character.
*/
GeneralCategory unicodeCategory.get()
{
Byte[] categoriesByCodepoint = #./CharCats.dat;
return codepoint < categoriesByCodepoint.size
? GeneralCategory.values[categoriesByCodepoint[codepoint].toInt()]
: Unassigned;
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment