Last active
May 17, 2018 11:57
-
-
Save tantale/a824fa0948d986d824e6a9965b488d5f to your computer and use it in GitHub Desktop.
Normalize a string to ASCII: convert accents to non accented characters.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import unicodedata | |
def normalize(string, encoding="utf-8"): | |
u""" | |
Normalize a string to ASCII: convert accents to non accented characters. | |
>>> normalize(u"Dès Noël où un zéphyr haï me vêt de glaçons würmiens je dîne d'exquis rôtis de bœuf au kir à l'aÿ d'âge mûr & cætera") | |
"Des Noel ou un zephyr hai me vet de glacons wurmiens je dine d'exquis rotis de boeuf au kir a l'ay d'age mur & caetera" | |
:type string: str | bytes | unicode | |
:param string: the unicode or binary string to normalize. | |
:param str encoding: Encoding used to decode binary strings. | |
:return: the normalized string. | |
""" | |
string = string.decode(encoding) if isinstance(string, type(b'')) else string | |
# replace "oe" and "ae" letters, or else they are dropped! | |
string = string.replace(u"æ", u"ae").replace(u"Æ", u"AE") | |
string = string.replace(u"œ", u"oe").replace(u"Œ", u"OE") | |
return unicodedata.normalize('NFKD', string).encode('ascii', 'ignore') |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment