Skip to content

Instantly share code, notes, and snippets.

@andjc
Last active January 28, 2024 05:13
Show Gist options
  • Star 5 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save andjc/6040ed6cce385270ade251e6c2c1661f to your computer and use it in GitHub Desktop.
Save andjc/6040ed6cce385270ade251e6c2c1661f to your computer and use it in GitHub Desktop.
Unicode casefolding and matching

Casefolding and matching

Default Case Folding

It is common to see the str.lower() method used in Python code when the developer wants to compare or match strings written in bicameral scripts. But it is not universal. For instance, the default case for Cherokee is uppercase instead of lowercase.

>>> s = "Ꮒꭶꮣ ꭰꮒᏼꮻ ꭴꮎꮥꮕꭲ ꭴꮎꮪꮣꮄꮣ ꭰꮄ ꭱꮷꮃꭽꮙ ꮎꭲ ꭰꮲꮙꮩꮧ ꭰꮄ ꭴꮒꮂ ꭲᏻꮎꮫꮧꭲ. Ꮎꮝꭹꮎꮓ ꭴꮅꮝꭺꮈꮤꮕꭹ ꭴꮰꮿꮝꮧ ꮕᏸꮅꮫꭹ ꭰꮄ ꭰꮣꮕꮦꮯꮣꮝꮧ ꭰꮄ ꭱꮅꮝꮧ ꮟᏼꮻꭽ ꮒꮪꮎꮣꮫꮎꮥꭼꭹ ꮎ ꮧꮎꮣꮕꮯ ꭰꮣꮕꮩ ꭼꮧ."
>>> sl = s.lower()
>>> su = s.upper()
>>> scf = s.casefold()
>>> sl == scf
False
>>> su == scf
True

It can be seen that casefolding Cherokee yields a string that matches the string produced by the uppercase function, rather than the lowercase function.

Case conversion (str.lower(), str.upper() and str.title() methods) are used to ensure a string is in the specified case. Casefolding on the other hand, is to allow caseless matching of strings. A casefolded string is often lowercase for bicameral scripts, although Cherokee is casefolded to uppercase.

The string operation str.casefold() is Python's implementation of Unicode’s toCasefold().

The formal definition of toCasefold() is:

toCasefold(X): Map each character C in X to Case_Folding(C).

Casefolding can be either invariant (language insensitive) or it can use locale specific tailorings. The python implementation of str.casefold() is invariant. In order to use full casefolding that responds to locale tailorings, use icu.UnicodeString.foldCase().

Full casefolding, can make irreversable changes to the text, and is most useful when comparing or matching text. If you wish to normalise text, then it is better to use a specific Unicode normalisation form, and an appropriate case conversion or mapping.

Casefolding for identifiers: toNFKC_Casefold()

When the string is an identifier, a modified form of caseless matching is used: toNFKC_Casefold(). The Unicode defintion of toNFKC_Casefold() is:

toNFKC_Casefold( X ): Map each character C in X to NFKC_Casefold(C) and then normalize the resulting string to NFC.

Python does not implement this version of casefolding, an alternative would be:

import unicodedata as ud
def NFKC_Casefold(s):
  return ud.normalize("NFC", ud.normalize('NFKC', s).casefold())

Default caseless matching

Default caseless matching is defined as:

A string X is a caseless match for a string Y if and only if:
toCasefold(X) = toCasefold(Y)

In python default caseless matching would be:

X.casefold() == Y.casefold()

Canonical caseless matching

Ideally strings should also be normalised before comparing the strings. Canonical caseless matching is defined as:

A string X is a canonical caseless match for a string Y if and only if:
NFD(toCasefold(NFD(X))) = NFD(toCasefold(NFD(Y)))

This could be expressed in Python as:

ud.normalize("NFD", ud.normalize("NFD", X).casefold()) == ud.normalize("NFD", ud.normalize("NFD", Y).casefold())

Compatibility caseless match

If you wish to ignore compatability differences between characters when comparing strings, you would use compatibility caseless matching, which is defined as:

A string X is a compatibility caseless match for a string Y if and only if:
NFKD(toCasefold(NFKD(toCasefold(NFD(X))))) = NFKD(toCasefold(NFKD(toCasefold(NFD(Y)))))

This could be expressed in python as:

ud.normalize("NFKD", ud.normalize("NFKD", ud.normalize("NFD", X).casefold()).casefold()) == ud.normalize("NFKD", ud.normalize("NFKD", ud.normalize("NFD", Y).casefold()).casefold())

Identifier caseless matching

Caseless matching for identifiers uses the NFKC_Casefold mapping discussed above, and is defined as:

A string X is an identifier caseless match for a string Y if and only if:
toNFKC_Casefold(NFD(X)) = toNFKC_Casefold(NFD(Y))

This can be expressed in python as:

NFKC_Casefold(ud.normalize("NFD", x)) == NFKC_Casefold(ud.normalize("NFD", y))
@andjc
Copy link
Author

andjc commented Mar 13, 2022

import unicodedata as ud

def caseless_match(x, y):
  return x.casefold() == y.casefold()

def canonical_caseless_match(x, y):
  return ud.normalize("NFD", ud.normalize("NFD", x).casefold()) == ud.normalize("NFD", ud.normalize("NFD", y).casefold())

def compatibility_caseless_match(x, y):
  return ud.normalize("NFKD", ud.normalize("NFKD", ud.normalize("NFD", x).casefold()).casefold()) == ud.normalize("NFKD", ud.normalize("NFKD", ud.normalize("NFD", y).casefold()).casefold())

def NFKC_Casefold(s):
  return ud.normalize("NFC", ud.normalize('NFKC', s).casefold())

def identifier_caseless_match(x, y):
  return NFKC_Casefold(ud.normalize("NFD", x)) == NFKC_Casefold(ud.normalize("NFD", y))

@andjc
Copy link
Author

andjc commented Mar 15, 2022

One useful application of casefolding is cleaning up scraped data that has presentation forms in the text.

I will start by defining a couple of helper functions:

import regex as re

def has_presentation_forms(text):
    pattern = r'([\p{InAlphabetic_Presentation_Forms}\p{InArabic_Presentation_Forms-A}\p{InArabic_Presentation_Forms-B}]+)'
    return bool(re.findall(pattern, text))

def clean_presentation_forms(text, folding=False):
    def clean_pf(match):
        return match.group(1).casefold()
    def clean_nfkc(match):
        return ud.normalize("NFKC", match.group(1))
    pattern = r'([\p{InAlphabetic_Presentation_Forms}\p{InArabic_Presentation_Forms-A}\p{InArabic_Presentation_Forms-B}]+)'
    result = re.sub(pattern, clean_pf, text) if folding else re.sub(pattern, clean_nfkc, text)
    return result

Alternatively, the clean_presentation_forms function could be rewritten as:

def clean_presentation_forms(text, folding=False):
    def clean_pf(match, folding):
        return  match.group(1).casefold() if folding else ud.normalize("NFKC", match.group(1))
    pattern = r'([\p{InAlphabetic_Presentation_Forms}\p{InArabic_Presentation_Forms-A}\p{InArabic_Presentation_Forms-B}]+)'
    return re.sub(pattern, lambda match, folding=folding: clean_pf(match, folding), text)

We then run the functions on a string "find an office":

>>> s = "find an office"
>>> elu.cp(s)
'U+FB01 (fi) U+006E (n) U+0064 (d) U+0020 ( ) U+0061 (a) U+006E (n) U+0020 ( ) U+006F (o) U+FB03 (ffi) U+0063 (c) U+0065 (e)'
>>> print(has_presentation_forms(s))
True
>>> if has_presentation_forms(s):
    r = clean_presentation_forms(s)
... 
>>> print(elu.cp(r))
U+0066 (f) U+0069 (i) U+006E (n) U+0064 (d) U+0020 ( ) U+0061 (a) U+006E (n) U+0020 ( ) U+006F (o) U+0066 (f) U+0066 (f) U+0069 (i) U+0063 (c) U+0065 (e)
>>> 

@andjc
Copy link
Author

andjc commented Mar 16, 2022

A more streamlined version of clean_presentation_forms():

def clean_presentation_forms(text, folding=False):
    def clean_pf(match, folding):
        return  match.group(1).casefold() if folding else ud.normalize("NFKC", match.group(1))
    pattern = r'([\p{InAlphabetic_Presentation_Forms}\p{InArabic_Presentation_Forms-A}\p{InArabic_Presentation_Forms-B}]+)'
    return re.sub(pattern, lambda match, folding=folding: clean_pf(match, folding), text)

For Arabic and Hebrew scripts, use folding=False.

For Latin and Armenian scripts, use either folding=True or folding=False.

Default value is folding=False

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment