Which Unicode characters does Python's regular expressions' \w escape match?
It appears that the intent of UNICODE \w in both Python 2 and 3 is
to match every character in Unicode general categories
plus U+005F (
'_'). However, in 2.7 the
re module's idea of the
Unicode database is a little bit out of sync with the
module, such that four astral characters in category
Nl are not
matched when they should be:
U+012432CUNEIFORM NUMERIC SIGN SHAR2 TIMES GAL PLUS DISH
U+012433CUNEIFORM NUMERIC SIGN SHAR2 TIMES GAL PLUS MIN
U+012456CUNEIFORM NUMERIC SIGN NIGIDAMIN
U+012457CUNEIFORM NUMERIC SIGN NIGIDAESH
Note that neither is consistent with UTS#18 level 1, which defines
"word characters" as general category
everything that is "Alphabetic" (which has a complicated definition,
not exactly corresponding to any set of general categories), plus
U+200D (ZWNJ and ZWJ). Personally I think the Python
definition is more useful.
Note also that
unicodedata itself may be lagging substantially
behind Unicode. Python 2.7 has 5.2.0, 3.4 has 6.2.0, 3.5 has 8.0.0.
Unicode 9.0.0 is "scheduled for release in mid-2016".