Skip to content

Instantly share code, notes, and snippets.

@andjc
Last active July 12, 2022 01:36
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save andjc/821d85f0e10549f9e4ab8c84c1ee00f5 to your computer and use it in GitHub Desktop.
Save andjc/821d85f0e10549f9e4ab8c84c1ee00f5 to your computer and use it in GitHub Desktop.
Unicode normalisation and default Python sorting

Snippet at https://github.com/enabling-languages/python-i18n/blob/main/snippets/sort_key_normalise.py

Default python sorting

If we take two strings that differ only in the Unicode normalisation form they use, would Python sort them the same? The strings éa (00E9 0061) and éa (0065 0301 0061) are canonically equivalent, but when we lists that only differ in the normalisation form of these two strings, we find the sort order is different.

>>> lc = ["za", "éa", "eb", "ba"]
>>> sorted(lc)
['ba', 'eb', 'za', 'éa']
>>> ld = ['za', 'éa', 'eb', 'ba']
>>> sorted(ld)
['ba', 'eb', 'éa', 'za']

If both occur in the same list?

>>> lz = ["éa", "za", "éa", "eb", "ba"]
>>> sorted(lz)
['ba', 'eb', 'éa', 'za', 'éa']

Locale aware sorting

We use the same to lists again using locale aware sorting:

>>> import locale
>>> locale.setlocale(locale.LC_ALL, "en_US")
'en_US'
>>> sorted(lc, key=locale.strxfrm)
['ba', 'éa', 'eb', 'za']
>>> sorted(ld, key=locale.strxfrm)
['ba', 'eb', 'éa', 'za']

The precomposed list sorts as required for the locale, while the sort for the decomposed version differs again.

Towards a solution

If your data is drawn from disparate sources and may differ in nromalisation forms, the best approach would be to either normalise all your data before sorting, or to:

  • use a keyfunction that removes the unwanted distinctions during sorting, or
  • use a PyICU collator instance.

Key function

>>> import unicodedata as ud
>>> import locale
>>> def normalised_sort(s, nf="NFC", loc=False):
    if nf.upper() in ["NFC", "NFKC", "NFD", "NFKD"]:
        s = locale.strxfrm(ud.normalize(nf, s).lower()) if loc else ud.normalize(nf, s).lower()
    return s
...
>>> sorted(lc, key=normalised_sort)
['ba', 'eb', 'za', 'éa']
>>> sorted(ld, key=normalised_sort)
['ba', 'eb', 'za', 'éa']
>>> sorted(lc, key=lambda x: normalised_sort(x, "NFC"))
['ba', 'eb', 'za', 'éa']
>>> sorted(ld, key=lambda x: normalised_sort(x, "NFC"))
['ba', 'eb', 'za', 'éa']
>>> sorted(lc, key=lambda x: normalised_sort(x, "NFD"))
['ba', 'eb', 'éa', 'za']
>>> sorted(ld, key=lambda x: normalised_sort(x, "NFD"))
['ba', 'eb', 'éa', 'za']
>>> locale.setlocale(locale.LC_ALL, "en_AU.UTF-8")
'en_AU.UTF-8'
>>> sorted(lc, key=lambda x: normalised_sort(x, "NFC", loc=True))
['ba', 'éa', 'eb', 'za']
>>> sorted(ld, key=lambda x: normalised_sort(x, "NFC", loc=True))
['ba', 'éa', 'eb', 'za']

Using PyICU

The most robust solution is to use either the Collator or RuleBasedCollator mechanisms in PyICU:

>>> from icu import Locale, Collator
>>> collator = Collator.createInstance(Locale.getRoot())
>>> sorted(lc, key=collator.getSortKey)
['ba', 'éa', 'eb', 'za']
>>> sorted(ld, key=collator.getSortKey)
['ba', 'éa', 'eb', 'za']
@andjc
Copy link
Author

andjc commented Mar 23, 2022

Likewise with pandas dataframes:

import pandas as pd
data = {'Status':["za", "éa", "eb", "ba"], 'Value':[20, 21, 19, 18]}
df = pd.DataFrame(data)
sorted_df = df.sort_values(by=['Status'], ascending=True)
sorted_df

This would yield:

  Status Value
3 ba 18
2 eb 19
0 za 20
1 éa 21

One possible method of locale specific sorting would involve reindexing the dataframe:

import locale
locale.setlocale(locale.LC_COLLATE, "en_AU.UTF-8")
alt_df = df.set_index('Status')
alt_df_sort = alt_df.reindex(sorted(alt_df.index, key=locale.strxfrm)).reset_index()
alt_df_sort

This would yield:

  Status Value
0 ba 18
1 éa 21
2 eb 19
3 za 20

See https://github.com/enabling-languages/python-i18n/blob/main/notebooks/Collation.ipynb

Alternatively:

def df_sort(series, key):
    def sort_series(key=None,reverse=False):
        def sorter(series):
            series_list = list(series)
            return [series_list.index(i) for i in sorted(series_list,key=key,reverse=reverse)]
        return sorter
    if (isinstance(series, pd.Series)):
        sort_by_custom_dict = sort_series(key=key)
        return df.iloc[sort_by_custom_dict(series)]

import locale
locale.setlocale(locale.LC_COLLATE, "en_AU.UTF-8")
new_sorted_df = df_sort(df['Status'], locale.strxfrm)
new_sorted_df
  Status Value
3 ba 18
1 éa 21
2 eb 19
0 za 20

Note: df_sort() is based on a gist by A. Sean Pue, which in turn was based on a Michael Delgado's response to a Stackoverflow question.

@andjc
Copy link
Author

andjc commented Jul 12, 2022

Obsolete solution, refer to pandas sort notebook.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment