Andj andjc

## japanese-font-family.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                andjc
                / japanese-font-family.md
            
            
              Created
              December 1, 2019 00:22
                — forked from prantlf/japanese-font-family.md
            
              
                Japanese default css font family
              
          
    Most Japanese websites use default font sets provided on Windows, Mac or Ubuntu. The latest ones are Meiryo, Hiragino Kaku Gothic Pro and Noto. For older versions such like Windows XP, it is good to add former default fonts MS Gothic(or MS Mincho)/Osaka. Older Linux versions may include Takao fonts.
Some old browsers could not understand those font names in English, some others do not recognize the names in Japanese, so it is safe to write both in Japanese and English.
Meiryo and Hiragino's order is, because Mac users may have Meiryo from MS-Office, and Hiragino is more familiar and matching well on Mac, better by starting Hiragino series.
So the current recommended practice is like this:
font-family: "ヒラギノ角ゴ Pro W3", "Hiragino Kaku Gothic Pro", Osaka, メイリオ, Meiryo, "ＭＳ Ｐゴシック", "MS PGothic", "ＭＳ ゴシック" , "MS Gothic", "Noto Sans CJK JP", TakaoPGothic, sans-serif;


## gist:37d734339e010e4be6a86771ae76cde0

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                andjc
                / gist:37d734339e010e4be6a86771ae76cde0
            
            
              Created
              June 22, 2020 02:05
                — forked from dpk/gist:8325992
            
              
                PyICU cheat sheet
              
          
    PyICU cheat sheet

Because you can't get the docs.
Transliteration

Create a transliterator:
greek2latin = icu.Transliterator.createInstance('Greek-Latin')

  
## Installing PyICU, libpostal, pypostal on Mac OS X 10.14+.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                andjc
                / Installing PyICU, libpostal, pypostal on Mac OS X 10.14+.md
            
            
              Created
              June 22, 2021 13:56
                — forked from ddelange/Installing PyICU, libpostal, pypostal on Mac OS X 10.14+.md
            
              
                Installation instructions for libicu-dev, PyICU, libpostal, pypostal on Mac OS X 10.14+
              
          
    Installing PyICU, libpostal, pypostal on Mac OS X 10.14+

libicu-dev (PyICU dependency)

brew uninstall --ignore-dependencies icu4c
brew install pkg-config icu4c  # keg-only

  
## gist:d28411019877d6f5d79811263f886b4d
import PyICU

# below from http://stackoverflow.com/questions/13838405/custom-sorting-in-pandas-dataframe

df = token_count
locale = 'UR.UTF-8'
collator = icu.Collator.createInstance(icu.Locale(locale))

def sort_pd(key=None,reverse=False,cmp=None):
    def sorter(series):

## osx_pyicu.md

      
              1 file
            
          
              0 forks
            
          
              1 comment
            
          
              0 stars
            
          
                andjc
                / osx_pyicu.md
            
            
              Created
              September 3, 2021 05:25
                — forked from frytoli/osx_pyicu.md
            
          
    Installing Pyicu on Mac

Install icu4c and get the version

$ brew install icu4c

Do what homebrew tells you to do: set necessary env variables


## thai_font.py
!wget https://github.com/Phonbopit/sarabun-webfont/raw/master/fonts/thsarabunnew-webfont.ttf
# !pip install -U --pre matplotlib
import matplotlib as mpl
mpl.font_manager.fontManager.addfont('thsarabunnew-webfont.ttf') # 3.2+
mpl.rc('font', family='TH Sarabun New')

## gist:54a1a4a1e6441471fa436140fee080a6

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                andjc
                / gist:54a1a4a1e6441471fa436140fee080a6
            
            
              Created
              September 25, 2021 04:43
                — forked from timhodson/gist:033f635421628cc9361f
            
              
                Install yaz-client on a mac
              
          
    ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

brew install yaz

yaz-client will then be available and you can use it like this:
run the program and you get a Z> prompt.


## graphemes_python.md

      
              1 file
            
          
              0 forks
            
          
              4 comments
            
          
              1 star
            
          
                andjc
                / graphemes_python.md
            
            
              Last active
              September 18, 2024 20:05
            
              
                Grapheme tokenisation in Python
              
          
    When working with tokenisation and break iterators, it is sometimes necessary to work at the character, syllable, line, or sentence levels. Character level tokenisation is an interesting case. By character, I mean a user perceivable unit of text, which the Unicode standard would refer to as a grapheme. The usual way I see developers handling character level tokenisation of English is via list comprehension or typecasting a string to a list:
>>> t1 = "transformation"
>>> [char for char in t1]
['t', 'r', 'a', 'n', 's', 'f', 'o', 'r', 'm', 'a', 't', 'i', 'o', 'n']
This will give you discrete characters or codepoints. But this approach doesn't work as well for other languages.
Let's take a Dinka string as an example:

  
## format-numbers-spellout.md

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                andjc
                / format-numbers-spellout.md
            
            
              Created
              February 27, 2022 06:54
            
              
                Using PyICU to format and spellout numbers
              
          
    Spellout numbers

from icu import RuleBasedNumberFormat, URBNFRuleSetTag, Locale
n = 111000
formatter = RuleBasedNumberFormat(URBNFRuleSetTag.SPELLOUT, Locale("hi"))
formatter.format(n)   # 'एक लाख ग्यारह हज़ार'
formatter2 = RuleBasedNumberFormat(URBNFRuleSetTag.SPELLOUT, Locale("en_IN"))
formatter2.format(n) # lakh eleven thousand'

  
## convert_digits.py
import unicodedataplus as ud
import regex as re

def convert_digits(s, sep = (",", ".")):
    nd = re.compile(r'^-?\p{Nd}[,.\u066B\u066C\u0020\u2009\p{Nd}]*$')
    tsep, dsep = sep
    if nd.match(s):
        s = s.replace(tsep, "")
        s = ''.join([str(ud.decimal(c, c)) for c in s])
        if dsep in s:
	import PyICU

	# below from http://stackoverflow.com/questions/13838405/custom-sorting-in-pandas-dataframe

	df = token_count
	locale = 'UR.UTF-8'
	collator = icu.Collator.createInstance(icu.Locale(locale))

	def sort_pd(key=None,reverse=False,cmp=None):
	def sorter(series):
	!wget https://github.com/Phonbopit/sarabun-webfont/raw/master/fonts/thsarabunnew-webfont.ttf
	# !pip install -U --pre matplotlib
	import matplotlib as mpl
	mpl.font_manager.fontManager.addfont('thsarabunnew-webfont.ttf') # 3.2+
	mpl.rc('font', family='TH Sarabun New')
	import unicodedataplus as ud
	import regex as re

	def convert_digits(s, sep = (",", ".")):
	nd = re.compile(r'^-?\p{Nd}[,.\u066B\u066C\u0020\u2009\p{Nd}]*$')
	tsep, dsep = sep
	if nd.match(s):
	s = s.replace(tsep, "")
	s = ''.join([str(ud.decimal(c, c)) for c in s])
	if dsep in s: