Snippet at https://github.com/enabling-languages/python-i18n/blob/main/snippets/sort_key_normalise.py
Default python sorting
If we take two strings that differ only in the Unicode normalisation form they use, would Python sort them the same? The strings éa (00E9 0061) and éa (0065 0301 0061) are canonically equivalent, but when we lists that only differ in the normalisation form of these two strings, we find the sort order is different.
>>> lc = ["za", "éa", "eb", "ba"]
>>> sorted(lc)
['ba', 'eb', 'za', 'éa']
>>> ld = ['za', 'éa', 'eb', 'ba']
>>> sorted(ld)
['ba', 'eb', 'éa', 'za']
If both occur in the same list?
>>> lz = ["éa", "za", "éa", "eb", "ba"]
>>> sorted(lz)
['ba', 'eb', 'éa', 'za', 'éa']
Locale aware sorting
We use the same to lists again using locale aware sorting:
>>> import locale
>>> locale.setlocale(locale.LC_ALL, "en_US")
'en_US'
>>> sorted(lc, key=locale.strxfrm)
['ba', 'éa', 'eb', 'za']
>>> sorted(ld, key=locale.strxfrm)
['ba', 'eb', 'éa', 'za']
The precomposed list sorts as required for the locale, while the sort for the decomposed version differs again.
Towards a solution
If your data is drawn from disparate sources and may differ in nromalisation forms, the best approach would be to either normalise all your data before sorting, or to:
- use a keyfunction that removes the unwanted distinctions during sorting, or
- use a PyICU collator instance.
Key function
>>> import unicodedata as ud
>>> import locale
>>> def normalised_sort(s, nf="NFC", loc=False):
if nf.upper() in ["NFC", "NFKC", "NFD", "NFKD"]:
s = locale.strxfrm(ud.normalize(nf, s).lower()) if loc else ud.normalize(nf, s).lower()
return s
...
>>> sorted(lc, key=normalised_sort)
['ba', 'eb', 'za', 'éa']
>>> sorted(ld, key=normalised_sort)
['ba', 'eb', 'za', 'éa']
>>> sorted(lc, key=lambda x: normalised_sort(x, "NFC"))
['ba', 'eb', 'za', 'éa']
>>> sorted(ld, key=lambda x: normalised_sort(x, "NFC"))
['ba', 'eb', 'za', 'éa']
>>> sorted(lc, key=lambda x: normalised_sort(x, "NFD"))
['ba', 'eb', 'éa', 'za']
>>> sorted(ld, key=lambda x: normalised_sort(x, "NFD"))
['ba', 'eb', 'éa', 'za']
>>> locale.setlocale(locale.LC_ALL, "en_AU.UTF-8")
'en_AU.UTF-8'
>>> sorted(lc, key=lambda x: normalised_sort(x, "NFC", loc=True))
['ba', 'éa', 'eb', 'za']
>>> sorted(ld, key=lambda x: normalised_sort(x, "NFC", loc=True))
['ba', 'éa', 'eb', 'za']
Using PyICU
The most robust solution is to use either the Collator
or RuleBasedCollator
mechanisms in PyICU:
>>> from icu import Locale, Collator
>>> collator = Collator.createInstance(Locale.getRoot())
>>> sorted(lc, key=collator.getSortKey)
['ba', 'éa', 'eb', 'za']
>>> sorted(ld, key=collator.getSortKey)
['ba', 'éa', 'eb', 'za']
Likewise with pandas dataframes:
This would yield:
One possible method of locale specific sorting would involve reindexing the dataframe:
This would yield:
See https://github.com/enabling-languages/python-i18n/blob/main/notebooks/Collation.ipynb
Alternatively:
Note:
df_sort()
is based on a gist by A. Sean Pue, which in turn was based on a Michael Delgado's response to a Stackoverflow question.