Skip to content

Instantly share code, notes, and snippets.

@andjc
Last active December 5, 2023 04:14
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save andjc/0638e26c27a226c72b5f1d266cb358bd to your computer and use it in GitHub Desktop.
Save andjc/0638e26c27a226c72b5f1d266cb358bd to your computer and use it in GitHub Desktop.
Get skeleton for confusable characters

Exemplars for confusable characters (normalising confusable data)

Normally we preprocessing text, we want to normalise our data. Unicode Normalisation Forms KC and KD can be used for converting compatibility characters during normalisation. This will handle soem confusable characters, but not all. The function below attempts to normalise confusable characters.

In is_confusable() we parse a string using icu.SpoofChecker, which is based on Unicode Technical Report #36 and Unicode Technical Standard #39.

UTS 39 defines two strings to be confusable if they map to the same skeleton. A skeleton is a sequence of families of confusable characters, where each family has a single exemplar character.

The function will return a status (ASCII, True, False) and the exemplar (or skeleton) representation of the string. Basic Latin characters are either confusable or not (so should return True or False normally) but Basic Latin characters that are confusables are also exemplars for the sequence they belong to.

Exemplar sequences will be decomposed, so the exemplar of á <00E1> will be <U+0061, U+0301>.

import icu
def is_confusable(text):
    if text.isascii():
        return ("ASCII", text)
    checker = icu.SpoofChecker()
    checker.setRestrictionLevel(icu.URestrictionLevel.HIGHLY_RESTRICTIVE)
    status = True if text != checker.getSkeleton(icu.USpoofChecks.ALL_CHECKS, text) else False
    skeleton = checker.getSkeleton(icu.USpoofChecks.ALL_CHECKS, text)
    return (status, skeleton)

The characters e, \U0001d5be, and \u0435 all return the exemplar e (U+0065):

is_confusable('C')
# ('ASCII', 'e')
is_confusable('\U0001d5be')
# (True, 'e')
is_confusable('\u0435')
# (True, 'e')
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment