It appears that the intent of UNICODE \w in both Python 2 and 3 is
to match every character in Unicode general categories L*
and N*
,
plus U+005F ('_'
). However, in 2.7 the re
module's idea of the
Unicode database is a little bit out of sync with the unicodedata
module, such that four astral characters in category Nl
are not
matched when they should be:
U+012432
CUNEIFORM NUMERIC SIGN SHAR2 TIMES GAL PLUS DISHU+012433
CUNEIFORM NUMERIC SIGN SHAR2 TIMES GAL PLUS MINU+012456
CUNEIFORM NUMERIC SIGN NIGIDAMINU+012457
CUNEIFORM NUMERIC SIGN NIGIDAESH
Note that neither is consistent with UTS#18 level 1, which defines
"word characters" as general category Nd
(not Nl
or No
), plus
everything that is "Alphabetic" (which has a complicated definition,
not exactly corresponding to any set of general categories), plus
U+200C
and U+200D
(ZWNJ and ZWJ). Personally I think the Python
definition is more useful.
Note also that unicodedata
itself may be lagging substantially
behind Unicode. Python 2.7 has 5.2.0, 3.4 has 6.2.0, 3.5 has 8.0.0.
Unicode 9.0.0 is "scheduled for release in mid-2016".