Skip to content

Instantly share code, notes, and snippets.

@pombredanne
Last active January 7, 2020 16:26
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save pombredanne/b90d0090bed9e9e976ba2ea454b02abd to your computer and use it in GitHub Desktop.
Save pombredanne/b90d0090bed9e9e976ba2ea454b02abd to your computer and use it in GitHub Desktop.
Unicode re split issues
$ python
Python 3.6.8 (default, Dec 20 2019, 11:17:32)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> a='İrəli'
>>> len(a)
5
>>> len(a.lower())
6
>>> import re
>>> re.split('\\W', a)
['İrəli']
>>> re.split('\\W', a.lower())
['i', 'rəli']
>>> [len(x) for x in re.split('\\W', a.lower())]
[1, 4]
>>> sum(len(x.lower()) for x in a)
6
>>> len(a.lower())
6
>>> list(a)
['İ', 'r', 'ə', 'l', 'i']
>>> list(a.lower())
['i', '̇', 'r', 'ə', 'l', 'i']
>>> [x.isalpha() for x in a.lower()]
[True, False, True, True, True, True]
@pombredanne
Copy link
Author

>>> [ascii(x.lower()) for x in 'İ']
["'i\\u0307'"]
>>> [ascii(x) for x in 'İ'.lower()]
["'i'", "'\\u0307'"]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment