Skip to content

Instantly share code, notes, and snippets.

@Jonty
Last active October 29, 2018 15:06
Show Gist options
  • Save Jonty/6705090 to your computer and use it in GitHub Desktop.
Save Jonty/6705090 to your computer and use it in GitHub Desktop.
Unicode printable character filter
def strip_string(self, string):
"""Cleans a string based on a whitelist of printable unicode categories
You can find a full list of categories here:
http://www.fileformat.info/info/unicode/category/index.htm
"""
letters = ('LC', 'Ll', 'Lm', 'Lo', 'Lt', 'Lu')
numbers = ('Nd', 'Nl', 'No')
marks = ('Mc', 'Me', 'Mn')
punctuation = ('Pc', 'Pd', 'Pe', 'Pf', 'Pi', 'Po', 'Ps')
symbol = ('Sc', 'Sk', 'Sm', 'So')
space = ('Zs',)
allowed_categories = letters + numbers + marks + punctuation + symbol + space
return u''.join([ c for c in string if unicodedata.category(c) in allowed_categories ])
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment