Skip to content

Instantly share code, notes, and snippets.

View andjc's full-sized avatar

Andj andjc

  • Melbourne, Australia
View GitHub Profile
@andjc
andjc / docstrings.py
Created August 31, 2022 09:35 — forked from redlotus/docstrings.py
Google Style Python Docstrings
# -*- coding: utf-8 -*-
"""Example Google style docstrings.
This module demonstrates documentation as specified by the `Google Python
Style Guide`_. Docstrings may extend over multiple lines. Sections are created
with a section header and a colon followed by a block of indented text.
Example:
Examples can be given using either the ``Example`` or ``Examples``
sections. Sections support any reStructuredText formatting, including
@andjc
andjc / macos_lc_collate.md
Last active February 26, 2023 11:44
LC_COLLATE on macOS
$ ls -al /usr/share/locale/*/LC_COLLATE
lrwxr-xr-x  1 root  wheel    29 11 Jan 18:03 /usr/share/locale/af_ZA.ISO8859-1/LC_COLLATE -> ../la_LN.ISO8859-1/LC_COLLATE
lrwxr-xr-x  1 root  wheel    30 11 Jan 18:03 /usr/share/locale/af_ZA.ISO8859-15/LC_COLLATE -> ../la_LN.ISO8859-15/LC_COLLATE
lrwxr-xr-x  1 root  wheel    29 11 Jan 18:03 /usr/share/locale/af_ZA.UTF-8/LC_COLLATE -> ../la_LN.ISO8859-1/LC_COLLATE
lrwxr-xr-x  1 root  wheel    29 11 Jan 18:03 /usr/share/locale/af_ZA/LC_COLLATE -> ../la_LN.ISO8859-1/LC_COLLATE
lrwxr-xr-x  1 root  wheel    28 11 Jan 18:03 /usr/share/locale/am_ET.UTF-8/LC_COLLATE -> ../la_LN.US-ASCII/LC_COLLATE
lrwxr-xr-x  1 root  wheel    28 11 Jan 18:03 /usr/share/locale/am_ET/LC_COLLATE -> ../la_LN.US-ASCII/LC_COLLATE
-r--r--r--  1 root  wheel  2086 11 Jan 18:03 /usr/share/locale/be_BY.CP1131/LC_COLLATE
-r--r--r--  1 root  wheel  2086 11 Jan 18:03 /usr/share/locale/be_BY.CP1251/LC_COLLATE
@andjc
andjc / normalisation_sorting.md
Last active July 12, 2022 01:36
Unicode normalisation and default Python sorting

Snippet at https://github.com/enabling-languages/python-i18n/blob/main/snippets/sort_key_normalise.py

Default python sorting

If we take two strings that differ only in the Unicode normalisation form they use, would Python sort them the same? The strings éa (00E9 0061) and éa (0065 0301 0061) are canonically equivalent, but when we lists that only differ in the normalisation form of these two strings, we find the sort order is different.

>>> lc = ["za", "éa", "eb", "ba"]
>>> sorted(lc)
['ba', 'eb', 'za', 'éa']
@andjc
andjc / python_sorting.md
Last active March 19, 2022 04:38
Laguage tailored sorting on Python

For more detailed information refer to Introduction to collation.

Python's list.sort() and sorted() functions are language invariant and can not be tailored. They give the same results regardless of the collation required by the language of the text. The functions have a key parameter that can be used to modify the strings before sorting or can be used to target a particular component of an object to use for sorting. The sort results will be consistent across platforms.

The following examples use a random selection of Slovak words.

>>> words = ['zem', 'čučoriedka', 'drevo', 'štebot', 'cesta', 'černice', 'ďateľ', 'rum', 'železo', 'prameň', 'sob']
>>> sorted(words)
['cesta', 'drevo', 'prameň', 'rum', 'sob', 'zem', 'černice', 'čučoriedka', 'ďateľ', 'štebot', 'železo']
@andjc
andjc / letter_frequency.md
Last active March 17, 2022 10:58
Letter frequency of text
@andjc
andjc / casefolding_matching.md
Last active January 28, 2024 05:13
Unicode casefolding and matching

Casefolding and matching

Default Case Folding

It is common to see the str.lower() method used in Python code when the developer wants to compare or match strings written in bicameral scripts. But it is not universal. For instance, the default case for Cherokee is uppercase instead of lowercase.

>>> s = "Ꮒꭶꮣ ꭰꮒᏼꮻ ꭴꮎꮥꮕꭲ ꭴꮎꮪꮣꮄꮣ ꭰꮄ ꭱꮷꮃꭽꮙ ꮎꭲ ꭰꮲꮙꮩꮧ ꭰꮄ ꭴꮒꮂ ꭲᏻꮎꮫꮧꭲ. Ꮎꮝꭹꮎꮓ ꭴꮅꮝꭺꮈꮤꮕꭹ ꭴꮰꮿꮝꮧ ꮕᏸꮅꮫꭹ ꭰꮄ ꭰꮣꮕꮦꮯꮣꮝꮧ ꭰꮄ ꭱꮅꮝꮧ ꮟᏼꮻꭽ ꮒꮪꮎꮣꮫꮎꮥꭼꭹ ꮎ ꮧꮎꮣꮕꮯ ꭰꮣꮕꮩ ꭼꮧ."
>>> sl = s.lower()
>>> su = s.upper()
@andjc
andjc / isalpha.md
Last active December 22, 2022 00:18
Python's str.isalpha()

The Python string operator str.isalpha() is sometimes used as a constraint or validator. But how useful is this in code that needs to support multiple languages?

The python documentation indicates that isalpha() matches any Unicode character with general category properties of Lu, Ll, Lt, Lm, or Lo.

While, Unicode defines an alphabetic character as any Unicode character with a category of Ll + Other_Lowercase + Lu + Other_Uppercase + Lt + Lm + Lo + Nl + Other_Alphabetic.

So it would be possible for a Unicode regex using \p{Alphabetic} to match characters that isalpha() would not match. Although in most practical cases the results would be the same.

It is interesting to note that the general categories Mn and Mc are not part of the Python or Unicode definition of an alphabetic character. What does this mean in practice?

@andjc
andjc / convert_digits.py
Last active April 7, 2022 23:03
Convert digits (as string) to int or float as appropriate. Currenttly des not support ideographic numbers or algorithmic numbers.
import unicodedataplus as ud
import regex as re
def convert_digits(s, sep = (",", ".")):
nd = re.compile(r'^-?\p{Nd}[,.\u066B\u066C\u0020\u2009\p{Nd}]*$')
tsep, dsep = sep
if nd.match(s):
s = s.replace(tsep, "")
s = ''.join([str(ud.decimal(c, c)) for c in s])
if dsep in s:
@andjc
andjc / format-numbers-spellout.md
Created February 27, 2022 06:54
Using PyICU to format and spellout numbers

Spellout numbers

from icu import RuleBasedNumberFormat, URBNFRuleSetTag, Locale
n = 111000
formatter = RuleBasedNumberFormat(URBNFRuleSetTag.SPELLOUT, Locale("hi"))
formatter.format(n)   # 'एक लाख ग्यारह हज़ार'
formatter2 = RuleBasedNumberFormat(URBNFRuleSetTag.SPELLOUT, Locale("en_IN"))
formatter2.format(n) # lakh eleven thousand'
@andjc
andjc / graphemes_python.md
Last active May 20, 2024 01:59
Grapheme tokenisation in Python

When working with tokenisation and break iterators, it is sometimes necessary to work at the character, syllable, line, or sentence levels. Character level tokenisation is an interesting case. By character, I mean a user perceivable unit of text, which the Unicode standard would refer to as a grapheme. The usual way I see developers handling character level tokenisation of English is via list comprehension or typecasting a string to a list:

>>> t1 = "transformation"
>>> [char for char in t1]
['t', 'r', 'a', 'n', 's', 'f', 'o', 'r', 'm', 'a', 't', 'i', 'o', 'n']

This will give you discrete characters or codepoints. But this approach doesn't work as well for other languages. Let's take a Dinka string as an example: