Instantly share code, notes, and snippets.

Embed
What would you like to do?
Normalise (normalize) unicode data in Python to remove umlauts, accents etc.
# -*- coding: utf-8 -*-
import unicodedata
""" Normalise (normalize) unicode data in Python to remove umlauts, accents etc. """
data = u'naïve café'
normal = unicodedata.normalize('NFKD', data).encode('ASCII', 'ignore')
print normal
# prints "naive cafe"
@r3m0t

This comment has been minimized.

r3m0t commented Aug 14, 2013

LATIN SMALL LETTER O WITH STROKE becomes the empty string instead of LATIN SMALL LETTER O

@ranvijay9286

This comment has been minimized.

ranvijay9286 commented Aug 21, 2014

thanks it's working

@renatofmartins

This comment has been minimized.

renatofmartins commented Mar 4, 2015

It works with a list instead of a single string?

@drathier

This comment has been minimized.

drathier commented May 12, 2015

@renatofmartins use a list builder [unicodedata.normalize('NFKD', x).encode('ASCII', 'ignore') for x in my_list]

@frangeris

This comment has been minimized.

frangeris commented Aug 31, 2016

How can I keep the letter Ñ?

@erm3nda

This comment has been minimized.

erm3nda commented Apr 9, 2018

@frangeris, that's a great question. I've ended with that. Works perfectly but will ignore anything combined with ~ tilde char ('COMBINING TILDE'). Being exact, it will ONLY normalize letters combined with ´ or ` and nothing else:

def strip_accents_spain(string, accents=('COMBINING ACUTE ACCENT', 'COMBINING GRAVE ACCENT')):
    accents = set(map(unicodedata.lookup, accents))
    chars = [c for c in unicodedata.normalize('NFD', string) if c not in accents]
    return unicodedata.normalize('NFC', ''.join(chars))

Docs didn't say that much about wich combos can be used for normalize(), but you can get the whole idea here: ftp://ftp.unicode.org/Public/9.0.0/ucd/NormalizationTest.txt (Search for "COMBINING" at bottom document to see all options).

@hhubers

This comment has been minimized.

hhubers commented Aug 22, 2018

@frangeris a quick and probably non-pythonic solution is as follows:

line = "EL NIÑO"
line = line.replace('Ñ','-&-')
line= str(unicodedata.normalize('NFKD', line).encode('ascii','ignore'))[2:-1]
line = line.replace('-&-','Ñ')	

Replace -&- with some other random character combination that doesn't appear in your text
This is also case sensitive and character specific. You can always add more replace calls (not ideal).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment