Skip to content

Instantly share code, notes, and snippets.

@j4mie
Created August 30, 2010 12:44
Show Gist options
  • Star 35 You must be signed in to star a gist
  • Fork 11 You must be signed in to fork a gist
  • Save j4mie/557354 to your computer and use it in GitHub Desktop.
Save j4mie/557354 to your computer and use it in GitHub Desktop.
Normalise (normalize) unicode data in Python to remove umlauts, accents etc.
# -*- coding: utf-8 -*-
import unicodedata
""" Normalise (normalize) unicode data in Python to remove umlauts, accents etc. """
data = u'naïve café'
normal = unicodedata.normalize('NFKD', data).encode('ASCII', 'ignore')
print normal
# prints "naive cafe"
@r3m0t
Copy link

r3m0t commented Aug 14, 2013

LATIN SMALL LETTER O WITH STROKE becomes the empty string instead of LATIN SMALL LETTER O

@ranvijay9286
Copy link

thanks it's working

@renatofmartins
Copy link

It works with a list instead of a single string?

@drathier
Copy link

@renatofmartins use a list builder [unicodedata.normalize('NFKD', x).encode('ASCII', 'ignore') for x in my_list]

@frangeris
Copy link

How can I keep the letter Ñ?

@erm3nda
Copy link

erm3nda commented Apr 9, 2018

@frangeris, that's a great question. I've ended with that. Works perfectly but will ignore anything combined with ~ tilde char ('COMBINING TILDE'). Being exact, it will ONLY normalize letters combined with ´ or ` and nothing else:

def strip_accents_spain(string, accents=('COMBINING ACUTE ACCENT', 'COMBINING GRAVE ACCENT')):
    accents = set(map(unicodedata.lookup, accents))
    chars = [c for c in unicodedata.normalize('NFD', string) if c not in accents]
    return unicodedata.normalize('NFC', ''.join(chars))

Docs didn't say that much about wich combos can be used for normalize(), but you can get the whole idea here: ftp://ftp.unicode.org/Public/9.0.0/ucd/NormalizationTest.txt (Search for "COMBINING" at bottom document to see all options).

Copy link

ghost commented Aug 22, 2018

@frangeris a quick and probably non-pythonic solution is as follows:

line = "EL NIÑO"
line = line.replace('Ñ','-&-')
line= str(unicodedata.normalize('NFKD', line).encode('ascii','ignore'))[2:-1]
line = line.replace('-&-','Ñ')	

Replace -&- with some other random character combination that doesn't appear in your text
This is also case sensitive and character specific. You can always add more replace calls (not ideal).

@skierpage
Copy link

Nifty, but note it doesn't change Unicode punctuation such as left and right quotation marks and en-, em-, figure, and horizontal dashes (‘ ’, “ ” , – — ‒ ―) to their ASCII equivalents, it just strips them. I tried fiddling with unicodedata.normalize options without success. FWIW these punctuation characters are missing from the table in @erm3nda's link to ftp://ftp.unicode.org/Public/9.0.0/ucd/NormalizationTest.txt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment