Skip to content

Instantly share code, notes, and snippets.

@dpk
Last active February 27, 2024 05:08
Show Gist options
  • Star 12 You must be signed in to star a gist
  • Fork 3 You must be signed in to fork a gist
  • Save dpk/8325992 to your computer and use it in GitHub Desktop.
Save dpk/8325992 to your computer and use it in GitHub Desktop.
PyICU cheat sheet

PyICU cheat sheet

Because you can't get the docs.

Transliteration

Create a transliterator:

greek2latin = icu.Transliterator.createInstance('Greek-Latin')

Transliterate:

greek2latin.transliterate('Ψάπφω') # => 'Psápphō'

Inverse transformation:

latin2greek = icu.Transliterator.createInstance('Greek-Latin', icu.UTransDirection.REVERSE)
latin2greek.transliterate('Psápphō') # => 'Ψάπφω'

or

latin2greek = greek2latin.createInverse()
latin2greek.transliterate('Psápphō') # => 'Ψάπφω'

See http://demo.icu-project.org/icu-bin/translit and http://userguide.icu-project.org/transforms/general for an idea of what kind of transliteration is built in.

Locales

Create a locale object:

britain = icu.Locale('en-GB')
french_ca = icu.Locale('fr_CA')
# etc.

… there's also a few shortcuts:

icu.Locale.getFrance()
icu.Locale.getDefault()
# etc.

You can get a few bits of information like name from each locale object:

britain.getDisplayName() # => 'English (United Kingdom)'
french_ca.getDisplayLanguage() # => 'French'
# etc.

Collation

See the bit above on Locales first, you'll need to understand locales in order to work the collator.

Create a collator for a particular Locale:

collator = icu.Collator.createInstance(icu.Locale('en_GB'))

Sort a list of strings, e.g.:

sorted(['sandwiches', 'angel delight', 'custard', 'éclairs', 'glühwein'], key=collator.getSortKey) #=> ['angel delight', 'custard', 'éclairs', 'glühwein', 'sandwiches']

Rule-based collation (tailoring)

The following makes (or should make — tailoring is a bit of a black art) thorn (Þþ) sort in Old English order (see Michael Everson's article, Sorting the letter ÞORN):

collator = icu.RuleBasedCollator('[normalization on]\n&t<þ<u\n&T<Þ<U\n&Þ=þ')
sorted(['þinking', 'tweet', 'uppity', 'Typography', 'Þeology', 'Urology'], key=collator.getSortKey) # => ['tweet', 'Typography', 'Þeology', 'þinking', 'uppity', 'Urology']

Tweaking a locale-based collocation with extra rules

Ignore word breaks in Welsh:

rules = icu.Collator.createInstance(icu.Locale('cy')).getRules()
rules = '[alternate shifted]' + rules
collator = icu.RuleBasedCollator(rules)

Date format

Date-time:

formatter = icu.DateFormat.createDateTimeInstance(icu.DateFormat.LONG, icu.DateFormat.kDefault, icu.Locale('de_DE'))
formatter.format(datetime.now()) #=> '26. Juli 2014 14:57:22'

Date only/time only, replace the first line with e.g.:

formatter = icu.DateFormat.createDateInstance(icu.DateFormat.LONG, icu.Locale('de_DE'))
formatter = icu.DateFormat.createTimeInstance(icu.DateFormat.LONG, icu.Locale('de_DE'))

Break Iteration

Unfortunately this is even more of a pain than you’d hope.

de_words = icu.BreakIterator.createWordInstance(icu.Locale('de_DE'))
de_words.setText('Bist du in der U-Bahn geboren?')
de_words.nextBoundary() #=> 4
de_words.nextBoundary() #=> 5
# etc.

The following function might be useful:

def iterate_breaks(text, break_iterator):
    break_iterator.setText(text)
    lastpos = 0
    while True:
        next_boundary = break_iterator.nextBoundary()
        if next_boundary == -1: return
        yield text[lastpos:next_boundary]
        lastpos = next_boundary

Usage:

de_words = icu.BreakIterator.createWordInstance(icu.Locale('de_DE'))
list(iterate_breaks('Bist du in der U-Bahn geboren?', de_words))
#=> ['Bist', ' ', 'du', ' ', 'in', ' ', 'der', ' ', 'U', '-', 'Bahn', ' ', 'geboren', '?']
@ibraheem-moosa
Copy link

Hi, thanks for this. Saved me a lot of time. 👍

@satchamo
Copy link

In the last example, make sure the text argument is a UnicodeString object, and not a native Python string. If you use a native string, the indices for the slices won't line up because Python doesn't count characters the same way as the library. For example:

import icu
t = "hello 🤦🏼‍♂️ world" # native Python string
bi = icu.BreakIterator.createLineInstance(icu.Locale('en_US'))
bi.setText(t)
position= 0
while True:
    next_position = bi.nextBoundary()
    if next_position == -1:
        break
    print(t[position:next_position])
    position = next_position

If you use the UnicodeString, you won't have the issue:

import icu
t = icu.UnicodeString("hello 🤦🏼‍♂️ world") # icu string
bi = icu.BreakIterator.createLineInstance(icu.Locale('en_US'))
bi.setText(t)
position= 0
while True:
    next_position = bi.nextBoundary()
    if next_position == -1:
        break
    print(t[position:next_position])
    position = next_position

@alanorth
Copy link

alanorth commented Jan 9, 2024

In 2024, with icu 74.2 and PyICU 2.12 I get:

AttributeError: module 'icu' has no attribute 'Collator'

@Hesam1991
Copy link

I have similar error.
When I run bellow code:
collator = icu.Collator.createInstance(icu.Locale('fa_IR.UTF-8'))
I receive bellow error:
AttributeError: module 'icu' has no attribute 'Collator'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment