Skip to content

Instantly share code, notes, and snippets.

@santhoshtr
Created April 6, 2018 17:21
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save santhoshtr/e4bb7d8bb952ee5d4cc43b1f0b4d9b38 to your computer and use it in GitHub Desktop.
Save santhoshtr/e4bb7d8bb952ee5d4cc43b1f0b4d9b38 to your computer and use it in GitHub Desktop.
ICU based string comparison using various collation strengths
from icu import Locale, Collator as ICUCollator
import locale
collator = ICUCollator.createInstance(Locale("ml_IN"))
word1="അവൻ"
word2="അവ‍ന്\u200d" # "അവന്"
collator.setStrength(ICUCollator.PRIMARY);
print("[ICU] Are they primary equal? ", collator.compare(word1, word2))
collator.setStrength(ICUCollator.SECONDARY);
print("[ICU] Are they secondary equal? ", collator.compare(word1, word2))
collator.setStrength(ICUCollator.TERTIARY);
print("[ICU] Are they tertiary equal? ", collator.compare(word1, word2))
locale.setlocale(locale.LC_ALL, "ml_IN.UTF-8")
print("[GLIBC] Are they equal? ", locale.strcoll(word1, word2))
@santhoshtr
Copy link
Author

Sample output

[ICU] Are they primary equal?  0
[ICU] Are they secondary equal?  0
[ICU] Are they tertiary equal?  1
[GLIBC] Are they equal?  955

@santhoshtr
Copy link
Author

Basically this says that the word with atomic chillu and zwj based chillu are different only at tertiary level.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment