Skip to content

Instantly share code, notes, and snippets.

@lsdr
Last active July 19, 2019 14:44
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save lsdr/36e5e7b76a133776f760817b86cfe348 to your computer and use it in GitHub Desktop.
Save lsdr/36e5e7b76a133776f760817b86cfe348 to your computer and use it in GitHub Desktop.
Remove accentuation from a utf-8/latin-1 "string" (which should actually a byte-string) to ascii
import unicodedata
def asciify(string, encoding='latin-1'):
"""Given an string with bytes coming from a DB or other ill-developed data
extraction, cleanup and return an ASCII string free of accentuations.
>>> asciify('Ba\xc3\xba')
'Bau'
>>> asciify('Bau')
'Bau'
>>> asciify('Baú', encoding='utf-8')
'Bau'
"""
bs = bytes(string, encoding)
ns = bs.decode('utf-8')
ns = unicodedata.normalize('NFKD', ns)
ns = ns.encode('ascii', 'ignore')
return str(ns, 'ascii')
if __name__ == "__main__":
import doctest
doctest.testmod(verbose=True)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment