Skip to content

Instantly share code, notes, and snippets.

@KevOrr
Created August 31, 2019 23:33
Show Gist options
  • Save KevOrr/a543daee927a4d8a19cba98a7bd4d5a8 to your computer and use it in GitHub Desktop.
Save KevOrr/a543daee927a4d8a19cba98a7bd4d5a8 to your computer and use it in GitHub Desktop.
CPpython re performance

Performace of various CPython regexpressions in searching for a substring close to the end of a string

What makes sense to me:

  • patX_1: They are all constant-time since the greedy .* will effectively make the search start at the end
  • patX_2: The non-greedy .*? causes search to start at the beginning of the strings, making this a linear-time operation
  • patX_3: With only a plain pattern, re.search is essentially a non-greedy search, making this also a linear-time operation

What doesn't make sense to me:

  • patX_2 vs patX_3: why is it so much (~45x-50x) worse to use patX_2.search?
>>> with open('/dev/urandom', 'rb') as f:
... s1 = f.read(10_000)
...
>>> with open('/dev/urandom', 'rb') as f:
... s2 = f.read(1_000_000)
...
>>> sub1 = s1[-6:]
>>> sub1
b'\xf3\xec>w]\xf2'
>>> s1.index(sub1)
9994
>>> sub2 = s2[-6:]
>>> sub2
b'#\xe6yp\x88\xa2'
>>> s2.index(sub2)
999994
>>> pat1_1 = re.compile(b'.*' + sub1, re.DOTALL)
>>> pat1_2 = re.compile(b'.*?' + sub1, re.DOTALL)
>>> pat1_3 = re.compile(sub1, re.DOTALL)
>>> pat2_1 = re.compile(b'.*' + sub2, re.DOTALL)
>>> pat2_2 = re.compile(b'.*?' + sub2, re.DOTALL)
>>> pat2_3 = re.compile(sub2, re.DOTALL)
>>> def test(f):
... '''time `f` and return microseconds per call'''
... n, t = timeit.Timer(f).autorange()
... return t / n * 1_000_000
...
>>> test(lambda: pat1_1.match(s1)) # greedy match 1e4
0.22497081799883745
>>> test(lambda: pat1_1.search(s1)) # greedy search 1e4
0.2253069000034884
>>> test(lambda: pat1_2.match(s1)) # non-greedy match 1e4
83.40240159959649
>>> test(lambda: pat1_2.search(s1)) # non-greedy search 1e4
89.8189025996544
>>> test(lambda: pat1_3.search(s1)) # plain search 1e4
1.7766958550055278
>>> test(lambda: pat2_1.match(s2)) # greedy match 1e6
0.22811782699864125
>>> test(lambda: pat2_1.search(s2)) # greedy search 1e6
0.22971309500280768
>>> test(lambda: pat2_2.match(s2)) # non-greedy match 1e6
8330.847860052017
>>> test(lambda: pat2_2.search(s2)) # non-greedy search 1e6
8910.73361999588
>>> test(lambda: pat2_3.search(s2)) # plain search 1e6
182.58752599831496
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment