Skip to content

Instantly share code, notes, and snippets.

View andjc's full-sized avatar

Andj andjc

  • Melbourne, Australia
View GitHub Profile
@andjc
andjc / japanese-font-family.md
Created December 1, 2019 00:22 — forked from prantlf/japanese-font-family.md
Japanese default css font family

Most Japanese websites use default font sets provided on Windows, Mac or Ubuntu. The latest ones are Meiryo, Hiragino Kaku Gothic Pro and Noto. For older versions such like Windows XP, it is good to add former default fonts MS Gothic(or MS Mincho)/Osaka. Older Linux versions may include Takao fonts.

Some old browsers could not understand those font names in English, some others do not recognize the names in Japanese, so it is safe to write both in Japanese and English.

Meiryo and Hiragino's order is, because Mac users may have Meiryo from MS-Office, and Hiragino is more familiar and matching well on Mac, better by starting Hiragino series.

So the current recommended practice is like this:

font-family: "ヒラギノ角ゴ Pro W3", "Hiragino Kaku Gothic Pro", Osaka, メイリオ, Meiryo, "MS Pゴシック", "MS PGothic", "MS ゴシック" , "MS Gothic", "Noto Sans CJK JP", TakaoPGothic, sans-serif;
@andjc
andjc / gist:37d734339e010e4be6a86771ae76cde0
Created June 22, 2020 02:05 — forked from dpk/gist:8325992
PyICU cheat sheet

PyICU cheat sheet

Because you can't get the docs.

Transliteration

Create a transliterator:

greek2latin = icu.Transliterator.createInstance('Greek-Latin')
@andjc
andjc / Installing PyICU, libpostal, pypostal on Mac OS X 10.14+.md
Created June 22, 2021 13:56 — forked from ddelange/Installing PyICU, libpostal, pypostal on Mac OS X 10.14+.md
Installation instructions for libicu-dev, PyICU, libpostal, pypostal on Mac OS X 10.14+

Installing PyICU, libpostal, pypostal on Mac OS X 10.14+

libicu-dev (PyICU dependency)

brew uninstall --ignore-dependencies icu4c
brew install pkg-config icu4c  # keg-only
@andjc
andjc / gist:d28411019877d6f5d79811263f886b4d
Created August 31, 2021 01:44 — forked from seanpue/gist:e1cb846f676194ae77eb
Sort pandas dataframe using icu locale
import PyICU
# below from http://stackoverflow.com/questions/13838405/custom-sorting-in-pandas-dataframe
df = token_count
locale = 'UR.UTF-8'
collator = icu.Collator.createInstance(icu.Locale(locale))
def sort_pd(key=None,reverse=False,cmp=None):
def sorter(series):
@andjc
andjc / osx_pyicu.md
Created September 3, 2021 05:25 — forked from frytoli/osx_pyicu.md

Installing Pyicu on Mac

Install icu4c and get the version

$ brew install icu4c

Do what homebrew tells you to do: set necessary env variables

@andjc
andjc / thai_font.py
Created September 20, 2021 08:13 — forked from korakot/thai_font.py
Add thai font on Google Colaboratory notebook
!wget https://github.com/Phonbopit/sarabun-webfont/raw/master/fonts/thsarabunnew-webfont.ttf
# !pip install -U --pre matplotlib
import matplotlib as mpl
mpl.font_manager.fontManager.addfont('thsarabunnew-webfont.ttf') # 3.2+
mpl.rc('font', family='TH Sarabun New')
@andjc
andjc / gist:54a1a4a1e6441471fa436140fee080a6
Created September 25, 2021 04:43 — forked from timhodson/gist:033f635421628cc9361f
Install yaz-client on a mac
ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

brew install yaz

yaz-client will then be available and you can use it like this:

run the program and you get a Z> prompt.

@andjc
andjc / graphemes_python.md
Last active March 6, 2022 23:07
Grapheme tokenisation in Python

When working with tokenisation and break iterators, it is sometimes necessary to work at the character, syllable, line, or sentence levels. Character level tokenisation is an interesting case. By character, I mean a user perceivable unit of text, which the Unicode standard would refer to as a grapheme. The usual way I see developers handling character level tokenisation of English is via list comprehension or typecasting a string to a list:

>>> t1 = "transformation"
>>> [char for char in t1]
['t', 'r', 'a', 'n', 's', 'f', 'o', 'r', 'm', 'a', 't', 'i', 'o', 'n']

This will give you discrete characters or codepoints. But this approach doesn't work as well for other languages. Let's take a Dinka string as an example:

@andjc
andjc / format-numbers-spellout.md
Created February 27, 2022 06:54
Using PyICU to format and spellout numbers

Spellout numbers

from icu import RuleBasedNumberFormat, URBNFRuleSetTag, Locale
n = 111000
formatter = RuleBasedNumberFormat(URBNFRuleSetTag.SPELLOUT, Locale("hi"))
formatter.format(n)   # 'एक लाख ग्यारह हज़ार'
formatter2 = RuleBasedNumberFormat(URBNFRuleSetTag.SPELLOUT, Locale("en_IN"))
formatter2.format(n) # lakh eleven thousand'
@andjc
andjc / convert_digits.py
Last active April 7, 2022 23:03
Convert digits (as string) to int or float as appropriate. Currenttly des not support ideographic numbers or algorithmic numbers.
import unicodedataplus as ud
import regex as re
def convert_digits(s, sep = (",", ".")):
nd = re.compile(r'^-?\p{Nd}[,.\u066B\u066C\u0020\u2009\p{Nd}]*$')
tsep, dsep = sep
if nd.match(s):
s = s.replace(tsep, "")
s = ''.join([str(ud.decimal(c, c)) for c in s])
if dsep in s: