Skip to content

Instantly share code, notes, and snippets.

@jhorneman
Created July 6, 2012 10:26
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jhorneman/3059407 to your computer and use it in GitHub Desktop.
Save jhorneman/3059407 to your computer and use it in GitHub Desktop.
How to filter out common unwanted characters in Python
character_replacements = [
( u'\u2018', u"'"), # LEFT SINGLE QUOTATION MARK
( u'\u2019', u"'"), # RIGHT SINGLE QUOTATION MARK
( u'\u201c', u'"'), # LEFT DOUBLE QUOTATION MARK
( u'\u201d', u'"'), # RIGHT DOUBLE QUOTATION MARK
( u'\u201e', u'"'), # DOUBLE LOW-9 QUOTATION MARK
( u'\u2013', u'-'), # EN DASH
( u'\u2026', u'...'), # HORIZONTAL ELLIPSIS
( u'\u0152', u'OE'), # LATIN CAPITAL LIGATURE OE
( u'\u0153', u'oe') # LATIN SMALL LIGATURE OE
]
for (undesired_character, safe_character) in character_replacements:
text = text.replace(undesired_character, safe_character)
@jhorneman
Copy link
Author

I know 'unwanted characters' can be controversial and the use case may be unclear. But this code was useful to me, in game development and when working with relatively primitive font systems.

See also my blog post: http://www.intelligent-artifice.com/2010/02/how-to-filter-out-common-unwanted-characters-in-python.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment