Created
April 21, 2017 01:09
-
-
Save roopalgarg/933a01d3dbf1cbb7f3c7a067413a39ba to your computer and use it in GitHub Desktop.
deaccent in python
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
def deaccent(text): | |
""" | |
Remove accentuation from the given string. Input text is either a unicode string or utf8 encoded bytestring. | |
Return input string with accents removed, as unicode. | |
>>> deaccent("Šéf chomutovských komunistů dostal poštou bílý prášek") | |
u'Sef chomutovskych komunistu dostal postou bily prasek' | |
""" | |
if not isinstance(text, unicode): | |
# assume utf8 for byte strings, use default (strict) error handling | |
text = text.decode('utf8') | |
norm = unicodedata.normalize("NFD", text) | |
result = u('').join(ch for ch in norm if unicodedata.category(ch) != 'Mn') | |
return unicodedata.normalize("NFC", result) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment