Create a gist now

Instantly share code, notes, and snippets.

Embed
What would you like to do?
TwitterCLDR notes.

Tailoring specs results (diff with ICU)

  1. JA:
tw:  failures: [["x ", "x"], ["X ", "X"], ["xゞx", "xヽ"]]
icu: failures: [["x ", "x"], ["X ", "X"]]

Character 'ゞ', code point 0x309E, is not in NFD (its normalized version is 0x309D 0x3099), but there is an entry for denormalized version of this string in FCE table - 309E; [0E 25, 05, 05][, DA 95, 05]. As all strings are normalized first, we don't use this entry, but instead build collation elements for this character from CE's for 0x309D and 0x3099 that are [0E 25, 05, 05] and [, DA 95, 05]. That doesn't cause any issue in the default locale, because the results are identical. But when 'ゝ' (code point 0x309D) is tailored from [0E 25, 05, 05] to [0E 29, 5, 5] in JA locale we get wrong [0E 29, 05, 05][, DA 95, 05] collation elements for 'ゞ'.

Only one test failure, but in practice there might be more cases like this one. The problem is that FCE table contains denormalized code points and as we normalize all strings before collation we fail to find collation elements. It's a bit unexpected and I'm not sure how we can fix it.

Tests failures for all other locales are identical to the ones of ICU, that might be considered a good result if we think of ICU as a reference implementation.

@camertron

This comment has been minimized.

Show comment
Hide comment
@camertron

camertron Jul 14, 2012

Hey @KL-7, I've got a few small corrections for this (awesome) writeup:

  1. Under "Summary", #3 JS should be JA.
  2. Under "Summary", #4 should be prefixed with ZH-HANT like the other ones.
  3. The links to the CLDR Trac repo seem to be broken...

Otherwise, this rocks. Thanks!

Hey @KL-7, I've got a few small corrections for this (awesome) writeup:

  1. Under "Summary", #3 JS should be JA.
  2. Under "Summary", #4 should be prefixed with ZH-HANT like the other ones.
  3. The links to the CLDR Trac repo seem to be broken...

Otherwise, this rocks. Thanks!

@KL-7

This comment has been minimized.

Show comment
Hide comment
@KL-7

KL-7 Jul 14, 2012

@camertron, I made the corrections, thanks. The links should be working, though. I believe they have some network issues today, because links from the official site are not opening either.

Owner

KL-7 commented Jul 14, 2012

@camertron, I made the corrections, thanks. The links should be working, though. I believe they have some network issues today, because links from the official site are not opening either.

@camertron

This comment has been minimized.

Show comment
Hide comment
@camertron

camertron Jul 25, 2012

Uppercase-first sorting for Danish is finished - can you update this gist?

Uppercase-first sorting for Danish is finished - can you update this gist?

@KL-7

This comment has been minimized.

Show comment
Hide comment
@KL-7

KL-7 Jul 25, 2012

Thanks for mentioning that. I completely removed Danish from the list, because we have only three failures with it now and all of them are identical to the failures of ICU.

Owner

KL-7 commented Jul 25, 2012

Thanks for mentioning that. I completely removed Danish from the list, because we have only three failures with it now and all of them are identical to the failures of ICU.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment