Skip to content

Instantly share code, notes, and snippets.

@devnoname120
Last active June 29, 2024 05:12
Show Gist options
  • Save devnoname120/59a92c24eb357e39c0b1c673f39f7059 to your computer and use it in GitHub Desktop.
Save devnoname120/59a92c24eb357e39c0b1c673f39f7059 to your computer and use it in GitHub Desktop.
[Ruby] Remove accents from UTF-8 string
class String
# See https://www.unicode.org/versions/Unicode15.0.0/UnicodeStandard-15.0.pdf#page=355
COMBINING_DIACRITICS = [*0x1DC0..0x1DFF, *0x0300..0x036F, *0xFE20..0xFE2F].pack('U*')
def removeaccents
self
.unicode_normalize(:nfd) # Decompose characters
.tr(COMBINING_DIACRITICS, '')
.unicode_normalize(:nfc) # Recompose characters
end
end
@devnoname120
Copy link
Author

@bkazez ł isn't replaced because it's not an accented character, but a self-standing letter that is part of the Polish alphabet. The stroke can't be “removed” because ł is a formed character, not a composed character. In fact it doesn't have any valid decomposition in unicode.

You nonetheless make a very interesting point! Even though ł isn't a composed character per se, it can still be useful to replace it with l to account for e.g. forms that were filled with an English keyboard (where l was used because ł wasn't available and it looked similar to it).

I'm not enthusiastic about I18n.transliterate() however because it converts to ? all the characters that can't be transliterated to the target locale.

If you only use it to compare strings then I suppose it works (although with false positives because the non-transliterable characters are all converted to ?). If you plan to store the result in a database or output it somewhere then I18n.transliterate() is a no-go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment