Skip to content

Instantly share code, notes, and snippets.

@tantale
Last active May 17, 2018 11:57
Show Gist options
  • Save tantale/a824fa0948d986d824e6a9965b488d5f to your computer and use it in GitHub Desktop.
Save tantale/a824fa0948d986d824e6a9965b488d5f to your computer and use it in GitHub Desktop.
Normalize a string to ASCII: convert accents to non accented characters.
import unicodedata
def normalize(string, encoding="utf-8"):
u"""
Normalize a string to ASCII: convert accents to non accented characters.
>>> normalize(u"Dès Noël où un zéphyr haï me vêt de glaçons würmiens je dîne d'exquis rôtis de bœuf au kir à l'aÿ d'âge mûr & cætera")
"Des Noel ou un zephyr hai me vet de glacons wurmiens je dine d'exquis rotis de boeuf au kir a l'ay d'age mur & caetera"
:type string: str | bytes | unicode
:param string: the unicode or binary string to normalize.
:param str encoding: Encoding used to decode binary strings.
:return: the normalized string.
"""
string = string.decode(encoding) if isinstance(string, type(b'')) else string
# replace "oe" and "ae" letters, or else they are dropped!
string = string.replace(u"æ", u"ae").replace(u"Æ", u"AE")
string = string.replace(u"œ", u"oe").replace(u"Œ", u"OE")
return unicodedata.normalize('NFKD', string).encode('ascii', 'ignore')
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment