-
-
Save devnoname120/59a92c24eb357e39c0b1c673f39f7059 to your computer and use it in GitHub Desktop.
class String | |
# See https://www.unicode.org/versions/Unicode15.0.0/UnicodeStandard-15.0.pdf#page=355 | |
COMBINING_DIACRITICS = [*0x1DC0..0x1DFF, *0x0300..0x036F, *0xFE20..0xFE2F].pack('U*') | |
def removeaccents | |
self | |
.unicode_normalize(:nfd) # Decompose characters | |
.tr(COMBINING_DIACRITICS, '') | |
.unicode_normalize(:nfc) # Recompose characters | |
end | |
end |
@bkazez ł
isn't replaced because it's not an accented character, but a self-standing letter that is part of the Polish alphabet. The stroke can't be “removed” because ł
is a formed character, not a composed character. In fact it doesn't have any valid decomposition in unicode.
You nonetheless make a very interesting point! Even though ł
isn't a composed character per se, it can still be useful to replace it with l
to account for e.g. forms that were filled with an English keyboard (where l
was used because ł
wasn't available and it looked similar to it).
I'm not enthusiastic about I18n.transliterate()
however because it converts to ?
all the characters that can't be transliterated to the target locale.
If you only use it to compare strings then I suppose it works (although with false positives because the non-transliterable characters are all converted to ?
). If you plan to store the result in a database or output it somewhere then I18n.transliterate()
is a no-go.
For me on
ruby 2.6.10p210 (2022-04-12 revision 67958) [universal.arm64e-darwin22]
, this doesn't convert ł to l. I had to use this:I did a benchmark on a string with 144525 characters and I18n appears faster:
On a string 1.46GB long, the I18n gem had the clear advantage:
With a more real-world test - an array of 1275 strings, averaging 111 characters each, the I18n gem is 3x faster: