# -*- coding: utf-8 -*- | |
import unicodedata | |
""" Normalise (normalize) unicode data in Python to remove umlauts, accents etc. """ | |
data = u'naïve café' | |
normal = unicodedata.normalize('NFKD', data).encode('ASCII', 'ignore') | |
print normal | |
# prints "naive cafe" |
This comment has been minimized.
This comment has been minimized.
ranvijay9286
commented
Aug 21, 2014
thanks it's working |
This comment has been minimized.
This comment has been minimized.
renatofmartins
commented
Mar 4, 2015
It works with a list instead of a single string? |
This comment has been minimized.
This comment has been minimized.
drathier
commented
May 12, 2015
@renatofmartins use a list builder |
This comment has been minimized.
This comment has been minimized.
frangeris
commented
Aug 31, 2016
How can I keep the letter |
This comment has been minimized.
This comment has been minimized.
erm3nda
commented
Apr 9, 2018
•
@frangeris, that's a great question. I've ended with that. Works perfectly but will ignore anything combined with ~ tilde char ('COMBINING TILDE'). Being exact, it will ONLY normalize letters combined with ´ or ` and nothing else:
Docs didn't say that much about wich combos can be used for normalize(), but you can get the whole idea here: ftp://ftp.unicode.org/Public/9.0.0/ucd/NormalizationTest.txt (Search for "COMBINING" at bottom document to see all options). |
This comment has been minimized.
This comment has been minimized.
hhubers
commented
Aug 22, 2018
•
@frangeris a quick and probably non-pythonic solution is as follows:
Replace -&- with some other random character combination that doesn't appear in your text |
This comment has been minimized.
r3m0t commentedAug 14, 2013
LATIN SMALL LETTER O WITH STROKE becomes the empty string instead of LATIN SMALL LETTER O