Andj andjc

## docstrings.py
# -*- coding: utf-8 -*-
"""Example Google style docstrings.

This module demonstrates documentation as specified by the `Google Python
Style Guide`_. Docstrings may extend over multiple lines. Sections are created
with a section header and a colon followed by a block of indented text.

Example:
    Examples can be given using either the ``Example`` or ``Examples``
    sections. Sections support any reStructuredText formatting, including

## macos_lc_collate.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                andjc
                / macos_lc_collate.md
            
            
              Last active
              February 26, 2023 11:44
            
              
                LC_COLLATE on macOS
              
          
    $ ls -al /usr/share/locale/*/LC_COLLATE
lrwxr-xr-x  1 root  wheel    29 11 Jan 18:03 /usr/share/locale/af_ZA.ISO8859-1/LC_COLLATE -> ../la_LN.ISO8859-1/LC_COLLATE
lrwxr-xr-x  1 root  wheel    30 11 Jan 18:03 /usr/share/locale/af_ZA.ISO8859-15/LC_COLLATE -> ../la_LN.ISO8859-15/LC_COLLATE
lrwxr-xr-x  1 root  wheel    29 11 Jan 18:03 /usr/share/locale/af_ZA.UTF-8/LC_COLLATE -> ../la_LN.ISO8859-1/LC_COLLATE
lrwxr-xr-x  1 root  wheel    29 11 Jan 18:03 /usr/share/locale/af_ZA/LC_COLLATE -> ../la_LN.ISO8859-1/LC_COLLATE
lrwxr-xr-x  1 root  wheel    28 11 Jan 18:03 /usr/share/locale/am_ET.UTF-8/LC_COLLATE -> ../la_LN.US-ASCII/LC_COLLATE
lrwxr-xr-x  1 root  wheel    28 11 Jan 18:03 /usr/share/locale/am_ET/LC_COLLATE -> ../la_LN.US-ASCII/LC_COLLATE
-r--r--r--  1 root  wheel  2086 11 Jan 18:03 /usr/share/locale/be_BY.CP1131/LC_COLLATE
-r--r--r--  1 root  wheel  2086 11 Jan 18:03 /usr/share/locale/be_BY.CP1251/LC_COLLATE

  
## normalisation_sorting.md

      
              1 file
            
          
              0 forks
            
          
              2 comments
            
          
              0 stars
            
          
                andjc
                / normalisation_sorting.md
            
            
              Last active
              July 12, 2022 01:36
            
              
                Unicode normalisation and default Python sorting
              
          
    Snippet at https://github.com/enabling-languages/python-i18n/blob/main/snippets/sort_key_normalise.py
Default python sorting

If we take two strings that differ only in the Unicode normalisation form they use, would Python sort them the same? The strings éa (00E9 0061) and éa (0065 0301 0061) are canonically equivalent, but when we lists that only differ in the normalisation form of these two strings, we find the sort order is different.
>>> lc = ["za", "éa", "eb", "ba"]
>>> sorted(lc)
['ba', 'eb', 'za', 'éa']

  
## python_sorting.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                andjc
                / python_sorting.md
            
            
              Last active
              March 19, 2022 04:38
            
              
                Laguage tailored sorting on Python
              
          
    For more detailed information refer to Introduction to collation.
Python's list.sort() and sorted() functions are language invariant and can not be tailored. They give the same results regardless of the collation required by the language of the text. The functions have a key parameter that can be used to modify the strings before sorting or can be used to target a particular component of an object to use for sorting. The sort results will be consistent across platforms.
The following examples use a random selection of Slovak words.
>>> words = ['zem', 'čučoriedka', 'drevo', 'štebot', 'cesta', 'černice', 'ďateľ', 'rum', 'železo', 'prameň', 'sob']
>>> sorted(words)
['cesta', 'drevo', 'prameň', 'rum', 'sob', 'zem', 'černice', 'čučoriedka', 'ďateľ', 'štebot', 'železo']

  
## letter_frequency.md

      
              1 file
            
          
              0 forks
            
          
              1 comment
            
          
              0 stars
            
          
                andjc
                / letter_frequency.md
            
            
              Last active
              March 17, 2022 10:58
            
              
                Letter frequency of text
              
          
    Based on python code posted on LinkedIn by Alekya D.
Using a tweaked version of Alice in Wonderland and the Dinka Padang translation of the UDHR
Refer to gists on graphemes and isalpha
import collections
import regex as re

  
## casefolding_matching.md

      
              1 file
            
          
              1 fork
            
          
              3 comments
            
          
              5 stars
            
          
                andjc
                / casefolding_matching.md
            
            
              Last active
              January 28, 2024 05:13
            
              
                Unicode casefolding and matching
              
          
    Casefolding and matching

Default Case Folding

It is common to see the str.lower() method used in Python code when the developer wants to compare or match strings written in bicameral scripts. But it is not universal. For instance, the default case for Cherokee is uppercase instead of lowercase.
>>> s = "Ꮒꭶꮣ ꭰꮒᏼꮻ ꭴꮎꮥꮕꭲ ꭴꮎꮪꮣꮄꮣ ꭰꮄ ꭱꮷꮃꭽꮙ ꮎꭲ ꭰꮲꮙꮩꮧ ꭰꮄ ꭴꮒꮂ ꭲᏻꮎꮫꮧꭲ. Ꮎꮝꭹꮎꮓ ꭴꮅꮝꭺꮈꮤꮕꭹ ꭴꮰꮿꮝꮧ ꮕᏸꮅꮫꭹ ꭰꮄ ꭰꮣꮕꮦꮯꮣꮝꮧ ꭰꮄ ꭱꮅꮝꮧ ꮟᏼꮻꭽ ꮒꮪꮎꮣꮫꮎꮥꭼꭹ ꮎ ꮧꮎꮣꮕꮯ ꭰꮣꮕꮩ ꭼꮧ."
>>> sl = s.lower()
&gt;&gt;&gt; su = s.upper()

  
## isalpha.md

      
              1 file
            
          
              0 forks
            
          
              1 comment
            
          
              0 stars
            
          
                andjc
                / isalpha.md
            
            
              Last active
              December 22, 2022 00:18
            
              
                Python's str.isalpha()
              
          
    The Python string operator str.isalpha() is sometimes used as a constraint or validator. But how useful is this in code that needs to support multiple languages?
The python documentation indicates that isalpha() matches any Unicode character with general category properties of Lu, Ll, Lt, Lm, or Lo.
While, Unicode defines an alphabetic character as any Unicode character with a category of Ll + Other_Lowercase + Lu + Other_Uppercase + Lt + Lm + Lo + Nl + Other_Alphabetic.
So it would be possible for a Unicode regex using \p{Alphabetic} to match characters that isalpha() would not match. Although in most practical cases the results would be the same.
It is interesting to note that the general categories Mn and Mc are not part of the Python or Unicode definition of an alphabetic character. What does this mean in practice?

  
## convert_digits.py
import unicodedataplus as ud
import regex as re

def convert_digits(s, sep = (",", ".")):
    nd = re.compile(r'^-?\p{Nd}[,.\u066B\u066C\u0020\u2009\p{Nd}]*$')
    tsep, dsep = sep
    if nd.match(s):
        s = s.replace(tsep, "")
        s = ''.join([str(ud.decimal(c, c)) for c in s])
        if dsep in s:

## format-numbers-spellout.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                andjc
                / format-numbers-spellout.md
            
            
              Created
              February 27, 2022 06:54
            
              
                Using PyICU to format and spellout numbers
              
          
    Spellout numbers

from icu import RuleBasedNumberFormat, URBNFRuleSetTag, Locale
n = 111000
formatter = RuleBasedNumberFormat(URBNFRuleSetTag.SPELLOUT, Locale("hi"))
formatter.format(n)   # 'एक लाख ग्यारह हज़ार'
formatter2 = RuleBasedNumberFormat(URBNFRuleSetTag.SPELLOUT, Locale("en_IN"))
formatter2.format(n) # lakh eleven thousand'

  
## graphemes_python.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                andjc
                / graphemes_python.md
            
            
              Last active
              May 20, 2024 01:59
            
              
                Grapheme tokenisation in Python
              
          
    When working with tokenisation and break iterators, it is sometimes necessary to work at the character, syllable, line, or sentence levels. Character level tokenisation is an interesting case. By character, I mean a user perceivable unit of text, which the Unicode standard would refer to as a grapheme. The usual way I see developers handling character level tokenisation of English is via list comprehension or typecasting a string to a list:
>>> t1 = "transformation"
>>> [char for char in t1]
['t', 'r', 'a', 'n', 's', 'f', 'o', 'r', 'm', 'a', 't', 'i', 'o', 'n']
This will give you discrete characters or codepoints. But this approach doesn't work as well for other languages.
Let's take a Dinka string as an example:
	# -- coding: utf-8 --
	"""Example Google style docstrings.

	This module demonstrates documentation as specified by the `Google Python
	Style Guide`_. Docstrings may extend over multiple lines. Sections are created
	with a section header and a colon followed by a block of indented text.

	Example:
	Examples can be given using either the ``Example`` or ``Examples``
	sections. Sections support any reStructuredText formatting, including
	import unicodedataplus as ud
	import regex as re

	def convert_digits(s, sep = (",", ".")):
	nd = re.compile(r'^-?\p{Nd}[,.\u066B\u066C\u0020\u2009\p{Nd}]*$')
	tsep, dsep = sep
	if nd.match(s):
	s = s.replace(tsep, "")
	s = ''.join([str(ud.decimal(c, c)) for c in s])
	if dsep in s: