andjc/normalisation_sorting.md

## normalisation_sorting.md

      
    Raw
  

              normalisation_sorting.md
            
          
    Snippet at https://github.com/enabling-languages/python-i18n/blob/main/snippets/sort_key_normalise.py
Default python sorting

If we take two strings that differ only in the Unicode normalisation form they use, would Python sort them the same? The strings éa (00E9 0061) and éa (0065 0301 0061) are canonically equivalent, but when we lists that only differ in the normalisation form of these two strings, we find the sort order is different.
>>> lc = ["za", "éa", "eb", "ba"]
>>> sorted(lc)
['ba', 'eb', 'za', 'éa']
>>> ld = ['za', 'éa', 'eb', 'ba']
>>> sorted(ld)
['ba', 'eb', 'éa', 'za']
If both occur in the same list?
>>> lz = ["éa", "za", "éa", "eb", "ba"]
>>> sorted(lz)
['ba', 'eb', 'éa', 'za', 'éa']
Locale aware sorting

We use the same to lists again using locale aware sorting:
>>> import locale
>>> locale.setlocale(locale.LC_ALL, "en_US")
'en_US'
>>> sorted(lc, key=locale.strxfrm)
['ba', 'éa', 'eb', 'za']
>>> sorted(ld, key=locale.strxfrm)
['ba', 'eb', 'éa', 'za']
The precomposed list sorts as required for the locale, while the sort for the decomposed version differs again.
Towards a solution

If your data is  drawn from disparate sources and may differ in nromalisation forms, the best approach would be to either normalise all your data before sorting, or to:

use a keyfunction that removes the unwanted distinctions during sorting, or
use a PyICU collator instance.

Key function

>>> import unicodedata as ud
>>> import locale
>>> def normalised_sort(s, nf="NFC", loc=False):
    if nf.upper() in ["NFC", "NFKC", "NFD", "NFKD"]:
        s = locale.strxfrm(ud.normalize(nf, s).lower()) if loc else ud.normalize(nf, s).lower()
    return s
...
>>> sorted(lc, key=normalised_sort)
['ba', 'eb', 'za', 'éa']
>>> sorted(ld, key=normalised_sort)
['ba', 'eb', 'za', 'éa']
>>> sorted(lc, key=lambda x: normalised_sort(x, "NFC"))
['ba', 'eb', 'za', 'éa']
>>> sorted(ld, key=lambda x: normalised_sort(x, "NFC"))
['ba', 'eb', 'za', 'éa']
>>> sorted(lc, key=lambda x: normalised_sort(x, "NFD"))
['ba', 'eb', 'éa', 'za']
>>> sorted(ld, key=lambda x: normalised_sort(x, "NFD"))
['ba', 'eb', 'éa', 'za']
>>> locale.setlocale(locale.LC_ALL, "en_AU.UTF-8")
'en_AU.UTF-8'
>>> sorted(lc, key=lambda x: normalised_sort(x, "NFC", loc=True))
['ba', 'éa', 'eb', 'za']
>>> sorted(ld, key=lambda x: normalised_sort(x, "NFC", loc=True))
['ba', 'éa', 'eb', 'za']
Using PyICU

The most robust solution is to use either the Collator or RuleBasedCollator mechanisms in PyICU:
>>> from icu import Locale, Collator
>>> collator = Collator.createInstance(Locale.getRoot())
>>> sorted(lc, key=collator.getSortKey)
['ba', 'éa', 'eb', 'za']
>>> sorted(ld, key=collator.getSortKey)
['ba', 'éa', 'eb', 'za']