Navigation Menu

Skip to content

Instantly share code, notes, and snippets.

View andjc's full-sized avatar

Andj andjc

  • Melbourne, Australia
View GitHub Profile
@andjc
andjc / installation_commands.txt
Last active April 21, 2024 13:40
Setup instructions for a Ubuntu 23.10 hashicorp vagrant box
sudo apt-get update
sudo apt-get install -y unzip git cmake python3-pip python3.11-venv libfreetype6-dev libharfbuzz-dev libfribidi-dev meson gtk-doc-tools libcairo2-dev libfontconfig-dev libjpeg-dev zlib1g-dev libpng-dev libtiff5-dev libfreetype6-dev liblcms2-dev libwebp-dev libxcb1-dev
mkdir ~/tmp
cd tmp
git clone https://github.com/HOST-Oman/libraqm.git
git clone https://github.com/ninja-build/ninja.git
cd ninja
./configure.py --bootstrap
@andjc
andjc / localised_dataframe_persian.ipynb
Created April 14, 2024 20:52
localised_dataframe_persian.ipynb
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@andjc
andjc / localised_dataframe_persian.ipynb
Created April 13, 2024 12:46
localised_dataframe_persian.ipynb
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@andjc
andjc / rbbi.json
Last active March 22, 2024 01:05
Rules sets for custom break iterators
{
"din": "!!quoted_literals_only; $CR = [\\p{Grapheme_Cluster_Break = CR}]; $LF = [\\p{Grapheme_Cluster_Break = LF}]; $Control = [[\\p{Grapheme_Cluster_Break = Control}]]; $Extend = [[\\p{Grapheme_Cluster_Break = Extend}]]; $ZWJ = [\\p{Grapheme_Cluster_Break = ZWJ}]; $Regional_Indicator = [\\p{Grapheme_Cluster_Break = Regional_Indicator}]; $Prepend = [\\p{Grapheme_Cluster_Break = Prepend}]; $SpacingMark = [\\p{Grapheme_Cluster_Break = SpacingMark}]; $Virama = [\\p{Gujr}\\p{sc=Telu}\\p{sc=Mlym}\\p{sc=Orya}\\p{sc=Beng}\\p{sc=Deva}&\\p{Indic_Syllabic_Category=Virama}]; $LinkingConsonant = [\\p{Gujr}\\p{sc=Telu}\\p{sc=Mlym}\\p{sc=Orya}\\p{sc=Beng}\\p{sc=Deva}&\\p{Indic_Syllabic_Category=Consonant}]; $ExtCccZwj = [[\\p{gcb=Extend}-\\p{ccc=0}] \\p{gcb=ZWJ}]; $L = [\\p{Grapheme_Cluster_Break = L}]; $V = [\\p{Grapheme_Cluster_Break = V}]; $T = [\\p{Grapheme_Cluster_Break = T}]; $LV = [\\p{Grapheme_Cluster_Break = LV}]; $LVT = [\\p{Grapheme_Cluster_Break = LVT}]; $Extended_Pict = [:ExtPict:]; !!chain; 'AA'|'Aa'|
@andjc
andjc / UAX_29.py
Created February 28, 2024 03:07 — forked from HughP/UAX_29.py
PyICU
# We start by loading up PyICU.
import PyICU as icu
# Let's create a test text. Notice it contains some punctuation.
test = u"This is (\"a\") test!"
# We create a wordbreak iterator. All break iterators in ICU are really RuleBasedBreakIterators, and we need to tell it which locale to take the word break rules from. Most locales have the same rules for UAX#29 so we will use English.
wb = icu.BreakIterator.createWordInstance(icu.Locale.getEnglish())
# An iterator is just that. It contains state and then we iterate over it. The state in this case is the text we want to break. So we set that.
@andjc
andjc / casefolding_matching.md
Last active January 28, 2024 05:13
Unicode casefolding and matching

Casefolding and matching

Default Case Folding

It is common to see the str.lower() method used in Python code when the developer wants to compare or match strings written in bicameral scripts. But it is not universal. For instance, the default case for Cherokee is uppercase instead of lowercase.

>>> s = "Ꮒꭶꮣ ꭰꮒᏼꮻ ꭴꮎꮥꮕꭲ ꭴꮎꮪꮣꮄꮣ ꭰꮄ ꭱꮷꮃꭽꮙ ꮎꭲ ꭰꮲꮙꮩꮧ ꭰꮄ ꭴꮒꮂ ꭲᏻꮎꮫꮧꭲ. Ꮎꮝꭹꮎꮓ ꭴꮅꮝꭺꮈꮤꮕꭹ ꭴꮰꮿꮝꮧ ꮕᏸꮅꮫꭹ ꭰꮄ ꭰꮣꮕꮦꮯꮣꮝꮧ ꭰꮄ ꭱꮅꮝꮧ ꮟᏼꮻꭽ ꮒꮪꮎꮣꮫꮎꮥꭼꭹ ꮎ ꮧꮎꮣꮕꮯ ꭰꮣꮕꮩ ꭼꮧ."
>>> sl = s.lower()
>>> su = s.upper()
@andjc
andjc / african_script_fonts.md
Created December 6, 2023 05:29
List of fonts supporting African scripts.

African Script fonts

Adlam

  • ADLaM Display – OFL 1.1; 1 file (Regular)
  • Ebrima – Commercial; 2 files (Regular, Bold)
  • Kigelia – Commercial; 6 files (Light, Light Italic, Regular, Italic, Bold, Bold Italic)
  • Noto Sans Adlam – OFL 1.1; 4 files (Regular, Medium, SemiBold, Bold); 1 variable font.
  • Noto Sans Adlam Unjoined – OFL 1.1; 4 files (Regular, Medium, SemiBold, Bold); 1 variable font.
@andjc
andjc / is_confusable.md
Last active December 5, 2023 04:14
Get skeleton for confusable characters

Exemplars for confusable characters (normalising confusable data)

Normally we preprocessing text, we want to normalise our data. Unicode Normalisation Forms KC and KD can be used for converting compatibility characters during normalisation. This will handle soem confusable characters, but not all. The function below attempts to normalise confusable characters.

In is_confusable() we parse a string using icu.SpoofChecker, which is based on Unicode Technical Report #36 and Unicode Technical Standard #39.

UTS 39 defines two strings to be confusable if they map to the same skeleton.

@andjc
andjc / ngraphs.py
Last active August 8, 2023 07:06
For a specific text (string) identify and count occurances of (character or grapheme based) ngraphs in text
from collections import Counter
import regex
class ngraphs:
"""Calculate ngraph occurrences for target string
Attributes
----------
text: str
A plain text string to be analysed. Specific to ngraph instance.
@andjc
andjc / work-with-multiple-github-accounts.md
Created August 7, 2023 15:33 — forked from rahularity/work-with-multiple-github-accounts.md
How To Work With Multiple Github Accounts on your PC

How To Work With Multiple Github Accounts on a single Machine

Let suppose I have two github accounts, https://github.com/rahul-office and https://github.com/rahul-personal. Now i want to setup my mac to easily talk to both the github accounts.

NOTE: This logic can be extended to more than two accounts also. :)

The setup can be done in 5 easy steps:

Steps:

  • Step 1 : Create SSH keys for all accounts
  • Step 2 : Add SSH keys to SSH Agent