Skip to content

Instantly share code, notes, and snippets.

@zackw
Last active November 27, 2015 15:43
Show Gist options
  • Save zackw/3077f387591376c7bf67 to your computer and use it in GitHub Desktop.
Save zackw/3077f387591376c7bf67 to your computer and use it in GitHub Desktop.
Determine what, exactly, re.compile("\w", re.UNICODE) matches

Which Unicode characters does Python's regular expressions' \w escape match?

It appears that the intent of UNICODE \w in both Python 2 and 3 is to match every character in Unicode general categories L* and N*, plus U+005F ('_'). However, in 2.7 the re module's idea of the Unicode database is a little bit out of sync with the unicodedata module, such that four astral characters in category Nl are not matched when they should be:

  • U+012432 CUNEIFORM NUMERIC SIGN SHAR2 TIMES GAL PLUS DISH
  • U+012433 CUNEIFORM NUMERIC SIGN SHAR2 TIMES GAL PLUS MIN
  • U+012456 CUNEIFORM NUMERIC SIGN NIGIDAMIN
  • U+012457 CUNEIFORM NUMERIC SIGN NIGIDAESH

Note that neither is consistent with UTS#18 level 1, which defines "word characters" as general category Nd (not Nl or No), plus everything that is "Alphabetic" (which has a complicated definition, not exactly corresponding to any set of general categories), plus U+200C and U+200D (ZWNJ and ZWJ). Personally I think the Python definition is more useful.

Note also that unicodedata itself may be lagging substantially behind Unicode. Python 2.7 has 5.2.0, 3.4 has 6.2.0, 3.5 has 8.0.0. Unicode 9.0.0 is "scheduled for release in mid-2016".

Unicode v5.2.0
Yes: Ll Lm Lo Lt Lu Nd No
No: Cc Cf Cn Co Cs Mc Me Mn Pd Pe Pf Pi Po Ps Sc Sk Sm So Zl Zp Zs
Amb: Nl Pc
y-Nl: 16ee 16ef 16f0 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 216a 216b 216c 216d 216e 216f 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 217a 217b 217c 217d 217e 217f 2180 2181 2182 2185 2186 2187 2188 3007 3021 3022 3023 3024 3025 3026 3027 3028 3029 3038 3039 303a a6e6 a6e7 a6e8 a6e9 a6ea a6eb a6ec a6ed a6ee a6ef 010140 010141 010142 010143 010144 010145 010146 010147 010148 010149 01014a 01014b 01014c 01014d 01014e 01014f 010150 010151 010152 010153 010154 010155 010156 010157 010158 010159 01015a 01015b 01015c 01015d 01015e 01015f 010160 010161 010162 010163 010164 010165 010166 010167 010168 010169 01016a 01016b 01016c 01016d 01016e 01016f 010170 010171 010172 010173 010174 010341 01034a 0103d1 0103d2 0103d3 0103d4 0103d5 012400 012401 012402 012403 012404 012405 012406 012407 012408 012409 01240a 01240b 01240c 01240d 01240e 01240f 012410 012411 012412 012413 012414 012415 012416 012417 012418 012419 01241a 01241b 01241c 01241d 01241e 01241f 012420 012421 012422 012423 012424 012425 012426 012427 012428 012429 01242a 01242b 01242c 01242d 01242e 01242f 012430 012431 012434 012435 012436 012437 012438 012439 01243a 01243b 01243c 01243d 01243e 01243f 012440 012441 012442 012443 012444 012445 012446 012447 012448 012449 01244a 01244b 01244c 01244d 01244e 01244f 012450 012451 012452 012453 012454 012455 012458 012459 01245a 01245b 01245c 01245d 01245e 01245f 012460 012461 012462
n-Nl: 012432 012433 012456 012457
y-Pc: 005f
n-Pc: 203f 2040 2054 fe33 fe34 fe4d fe4e fe4f ff3f
Unicode v6.3.0
Yes: Ll Lm Lo Lt Lu Nd Nl No
No: Cc Cf Cn Co Cs Mc Me Mn Pd Pe Pf Pi Po Ps Sc Sk Sm So Zl Zp Zs
Amb: Pc
y-Pc: 005f
n-Pc: 203f 2040 2054 fe33 fe34 fe4d fe4e fe4f ff3f
import re
import unicodedata
import sys
try:
unichr(0)
except:
unichr = chr
def unifmt(x):
if x <= 0xffff:
return " {:04x}".format(x)
else:
return " {:06x}".format(x)
def main():
out = sys.stdout.write
w = re.compile(u"^\w$", re.UNICODE)
out("Unicode v{}\n".format(unicodedata.unidata_version))
yes = {}
no = {}
for x in range(0x10ffff):
c = unichr(x)
cat = unicodedata.category(c)
if w.match(c):
yes[cat] = 1
else:
no[cat] = 1
ambig = {}
out("Yes:")
for cat in sorted(yes.keys()):
if cat in no:
ambig[cat] = 1
else:
out(" " + cat)
out("\n No:")
for cat in sorted(no.keys()):
if cat in yes:
ambig[cat] = 1
else:
out(" " + cat)
out("\nAmb:")
for cat in sorted(ambig.keys()):
out(" " + cat)
out("\n")
if not ambig: return
yes = {}
no = {}
for cat in ambig.keys():
yes[cat] = []
no[cat] = []
for x in range(0x10ffff):
c = unichr(x)
cat = unicodedata.category(c)
if cat not in ambig: continue
if w.match(c):
yes[cat].append(x)
else:
no[cat].append(x)
for cat in sorted(ambig.keys()):
Y = yes[cat]
N = no[cat]
Y.sort()
N.sort()
out("y-{}:".format(cat))
for y in Y:
out(unifmt(y))
out("\nn-{}:".format(cat))
for n in N:
out(unifmt(n))
out("\n")
main()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment