Skip to content

Instantly share code, notes, and snippets.

@j4mie
Created August 30, 2010 12:44
Show Gist options
  • Star 35 You must be signed in to star a gist
  • Fork 11 You must be signed in to fork a gist
  • Save j4mie/557354 to your computer and use it in GitHub Desktop.
Save j4mie/557354 to your computer and use it in GitHub Desktop.
Normalise (normalize) unicode data in Python to remove umlauts, accents etc.
# -*- coding: utf-8 -*-
import unicodedata
""" Normalise (normalize) unicode data in Python to remove umlauts, accents etc. """
data = u'naïve café'
normal = unicodedata.normalize('NFKD', data).encode('ASCII', 'ignore')
print normal
# prints "naive cafe"
Copy link

ghost commented Aug 22, 2018

@frangeris a quick and probably non-pythonic solution is as follows:

line = "EL NIÑO"
line = line.replace('Ñ','-&-')
line= str(unicodedata.normalize('NFKD', line).encode('ascii','ignore'))[2:-1]
line = line.replace('-&-','Ñ')	

Replace -&- with some other random character combination that doesn't appear in your text
This is also case sensitive and character specific. You can always add more replace calls (not ideal).

@skierpage
Copy link

Nifty, but note it doesn't change Unicode punctuation such as left and right quotation marks and en-, em-, figure, and horizontal dashes (‘ ’, “ ” , – — ‒ ―) to their ASCII equivalents, it just strips them. I tried fiddling with unicodedata.normalize options without success. FWIW these punctuation characters are missing from the table in @erm3nda's link to ftp://ftp.unicode.org/Public/9.0.0/ucd/NormalizationTest.txt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment