Skip to content

Instantly share code, notes, and snippets.

@j4mie
Created August 30, 2010 12:44
Show Gist options
  • Star 35 You must be signed in to star a gist
  • Fork 11 You must be signed in to fork a gist
  • Save j4mie/557354 to your computer and use it in GitHub Desktop.
Save j4mie/557354 to your computer and use it in GitHub Desktop.
Normalise (normalize) unicode data in Python to remove umlauts, accents etc.
# -*- coding: utf-8 -*-
import unicodedata
""" Normalise (normalize) unicode data in Python to remove umlauts, accents etc. """
data = u'naïve café'
normal = unicodedata.normalize('NFKD', data).encode('ASCII', 'ignore')
print normal
# prints "naive cafe"
@skierpage
Copy link

Nifty, but note it doesn't change Unicode punctuation such as left and right quotation marks and en-, em-, figure, and horizontal dashes (‘ ’, “ ” , – — ‒ ―) to their ASCII equivalents, it just strips them. I tried fiddling with unicodedata.normalize options without success. FWIW these punctuation characters are missing from the table in @erm3nda's link to ftp://ftp.unicode.org/Public/9.0.0/ucd/NormalizationTest.txt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment