andjc/casefolding_matching.md

## casefolding_matching.md

      
    Raw
  

              casefolding_matching.md
            
          
    Casefolding and matching

Default Case Folding

It is common to see the str.lower() method used in Python code when the developer wants to compare or match strings written in bicameral scripts. But it is not universal. For instance, the default case for Cherokee is uppercase instead of lowercase.
>>> s = "Ꮒꭶꮣ ꭰꮒᏼꮻ ꭴꮎꮥꮕꭲ ꭴꮎꮪꮣꮄꮣ ꭰꮄ ꭱꮷꮃꭽꮙ ꮎꭲ ꭰꮲꮙꮩꮧ ꭰꮄ ꭴꮒꮂ ꭲᏻꮎꮫꮧꭲ. Ꮎꮝꭹꮎꮓ ꭴꮅꮝꭺꮈꮤꮕꭹ ꭴꮰꮿꮝꮧ ꮕᏸꮅꮫꭹ ꭰꮄ ꭰꮣꮕꮦꮯꮣꮝꮧ ꭰꮄ ꭱꮅꮝꮧ ꮟᏼꮻꭽ ꮒꮪꮎꮣꮫꮎꮥꭼꭹ ꮎ ꮧꮎꮣꮕꮯ ꭰꮣꮕꮩ ꭼꮧ."
>>> sl = s.lower()
>>> su = s.upper()
>>> scf = s.casefold()
>>> sl == scf
False
>>> su == scf
True
It can be seen that casefolding Cherokee yields a string that matches the string produced by the uppercase function, rather than the lowercase function.
Case conversion (str.lower(), str.upper() and str.title() methods) are used to ensure a string is in the specified case. Casefolding on the other hand, is to allow caseless matching of strings. A casefolded string is often lowercase for bicameral scripts, although Cherokee is casefolded to uppercase.
The string operation str.casefold() is Python's implementation of Unicode’s toCasefold().
The formal definition of toCasefold() is:

toCasefold(X): Map each character C in X to Case_Folding(C).

Casefolding can be either invariant (language insensitive) or it can use locale specific tailorings. The python implementation of str.casefold() is invariant. In order to use full casefolding that responds to locale tailorings, use icu.UnicodeString.foldCase().
Full casefolding, can make irreversable changes to the text, and is most useful when comparing or matching text. If you wish to normalise text, then it is better to use a specific Unicode normalisation form, and an appropriate case conversion or mapping.
Casefolding for identifiers: toNFKC_Casefold()

When the string is an identifier, a modified form of caseless matching is used: toNFKC_Casefold(). The Unicode defintion of toNFKC_Casefold() is:

toNFKC_Casefold( X ): Map each character C in X to NFKC_Casefold(C) and then normalize the resulting string to NFC.

Python does not implement this version of casefolding, an alternative would be:
import unicodedata as ud
def NFKC_Casefold(s):
  return ud.normalize("NFC", ud.normalize('NFKC', s).casefold())
Default caseless matching

Default caseless matching is defined as:

A string X is a caseless match for a string Y if and only if: 

toCasefold(X) = toCasefold(Y)

In python default caseless matching would be:
X.casefold() == Y.casefold()
Canonical caseless matching

Ideally strings should also be normalised before comparing the strings. Canonical caseless matching is defined as:

A string X is a canonical caseless match for a string Y if and only if: 

NFD(toCasefold(NFD(X))) = NFD(toCasefold(NFD(Y)))

This could be expressed in Python as:
ud.normalize("NFD", ud.normalize("NFD", X).casefold()) == ud.normalize("NFD", ud.normalize("NFD", Y).casefold())
Compatibility caseless match

If you wish to ignore compatability differences between characters when comparing strings, you would use compatibility caseless matching, which is defined as:

A string X is a compatibility caseless match for a string Y if and only if: 

NFKD(toCasefold(NFKD(toCasefold(NFD(X))))) = NFKD(toCasefold(NFKD(toCasefold(NFD(Y)))))

This could be expressed in python as:
ud.normalize("NFKD", ud.normalize("NFKD", ud.normalize("NFD", X).casefold()).casefold()) == ud.normalize("NFKD", ud.normalize("NFKD", ud.normalize("NFD", Y).casefold()).casefold())
Identifier caseless matching

Caseless matching for identifiers uses the NFKC_Casefold mapping discussed above, and is defined as:

A string X is an identifier caseless match for a string Y if and only if: 

toNFKC_Casefold(NFD(X)) = toNFKC_Casefold(NFD(Y))

This can be expressed in python as:
NFKC_Casefold(ud.normalize("NFD", x)) == NFKC_Casefold(ud.normalize("NFD", y))