Skip to content

Instantly share code, notes, and snippets.

@acdha
Created March 7, 2013 20:52
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
Star You must be signed in to star a gist
Save acdha/5111687 to your computer and use it in GitHub Desktop.
Fun with regular expressions in Python and Unicode
I noticed an interesting failure while using re.match / re.sub to look for non-Cyrillic characters in allegedly Russian text:
>>> re.sub(r'[\s\u0400-\u0527]+', ' ', 'Архангельская губерния', flags=re.IGNORECASE)
'Архангельская губерния'
>>> re.sub(r'[\s\u0400-\u0527]+', '', 'Архангельская губерния', flags=0)
''
The same is true in Python 2.7, although you need to use ur'' patterns for the literals to be expanded:
>>> re.sub(ur'[\s\u0400-\u0527]+', '', u'Архангельская губерния', flags=re.IGNORECASE|regex.UNICODE)
u'\u0410\u0440\u0445\u0430\u043d\u0433\u0435\u043b\u044c\u0441\u043a\u0430\u044f\u0433\u0443\u0431\u0435\u0440\u043d\u0438\u044f'
In contrast, the regex module behaves as expected:
>>> regex.sub(ur'[\s\u0400-\u0527]+', '', u'Архангельская губерния', flags=regex.IGNORECASE|regex.UNICODE)
u''
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment