Skip to content

Instantly share code, notes, and snippets.

@goodmami
Last active September 8, 2023 04:10
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save goodmami/02f344e8c9a22fc9ac879230a9b9e071 to your computer and use it in GitHub Desktop.
Save goodmami/02f344e8c9a22fc9ac879230a9b9e071 to your computer and use it in GitHub Desktop.
Parsing JSON with regular expressions

Parsing JSON with Regular Expressions

When I learned of regular expression engines that support recursion I thought I could write a recursive-descent parser in regex. Since I've written JSON parsers a few times and it's a simple spec, I chose that as the test case. In the end I created two versions.

version 1

jsonre_v1 = regex.compile(
    r'''
    (?: (?P<object>   \{\s* (?&members)? \s*\})
        (?P<members>  (?&member) ( \s*,\s* (?&member) )*)
        (?P<member>   (?&string) \s*:\s* (?&value))
        (?P<array>    \[\s* ((?&value) (\s*,\s* (?&value))*)? \s*\])
        (?P<string>   " [^"\\]* (?: \\. | [^"\\]* )* ")
        (?P<number>   (?&integer) (?&fraction)? (?&exponent)?)
        (?P<integer>  -? (?: 0 | [1-9][0-9]* ))
        (?P<fraction> \. [0-9]*)
        (?P<exponent> [eE] [-+]? [0-9]+)
    ){0}
    (?P<value> (?&object) | (?&array) | (?&string) | (?&number) |
               true | false | null )
    ''',
    flags=regex.VERBOSE|regex.UNICODE)

The first version looks more like a PEG grammar, which I accomplished by putting all the "rules", except the starting one, in a group that repeats exactly 0 times, meaning its presence in the expression does not match something in the input but only establishes the recursive patterns.

version 2

jsonre_v2 = regex.compile(
    r'''
    (?P<value>
        (?P<object>   \{\s*
                      (?: (?P<member>(?&string) \s*:\s* (?&value))
                          ( \s*,\s* (?&member) )* )?
                      \s*\})
      | (?P<array>    \[\s* ((?&value) (\s*,\s* (?&value))*)? \s*\])
      | (?P<string>   " [^"\\]* (?: \\. | [^"\\]* )* ")
      | (?P<number>   (?P<integer> -? (?: 0 | [1-9][0-9]* ))
                      (?: \. [0-9]* )?
                      (?: [eE] [-+]? [0-9]+ )?)
      | true
      | false
      | null
    )
    ''',
    flags=regex.VERBOSE|regex.UNICODE)

The second version is defined more naturally as a regex, but it's a little harder to read as a grammar. It turns out this version is also more performant.

benchmarking

Below is the output of the program where I compare it to the built-in JSON parser written in C. First is a basic test for different kinds of JSON values:

======= FIRST TEST =======
{"key": [{"bool": true, "number": -3e-05}, {"string": "hello world", "unicode": "\u4e16\u754c\uff0c\u4f60\u597d"}]}
jsonre_v1 matches
jsonre_v1: 100x took 0.007197466999059543s
jsonre_v2 matches
jsonre_v2: 100x took 0.0032083129990496673s
json module: 100x took 0.0005846970016136765s

The v2 pattern is about twice as fast as v1, but both are an order of magnitude slower than the C parser (which isn't all that bad, really).

The second test has a very recursive structure: an array of arrays 100 levels deep:

======= SECOND TEST =======
[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]
jsonre_v1 matches
jsonre_v1: 100x took 0.05688649399962742s
jsonre_v2 matches
jsonre_v2: 100x took 0.01881064599729143s
json module: 100x took 0.0007663409996894188s

The v1 pattern is now about 3.5x slower than the v2 pattern, but both are more than two orders of magnitude slower than the C parser. Clearly recursion hurts the regular expression engine's performance more than non-recursive patterns.

The third test is to see the performance with a parse failure. I test this by prefixing the previous pattern with a object that is not closed at the end. I expect (and observe) that the regular expression parser, with it's backtracking, performs very poorly:

======= THIRD TEST =======
{"key":[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[[]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]
jsonre_v1 does not match
jsonre_v1: 100x took 1.6580609960001311s
jsonre_v2 does not match
jsonre_v2: 100x took 0.13697527600015746s
json module: 100x took 0.0015753240004414693s

The v1 parser is now about 30x slower than before, 12x slower than v2 for this test, and three orders of magnitude slower than the C parser for this test; the v2 parser is a little more than 7x slower than before and is still two orders of magnitude slower than the C parser. The 2x slowdown for the C parser is probably due to the exception handling I do rather than a slowdown of the parser itself.

related

After creating this I found that someone else had done it on StackOverflow: https://stackoverflow.com/a/3845829/1441112

And of course there's this (in)famous post about parsing HTML: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags

Wikipedia has a nice table showing various regular expression engines that support recursion: https://en.wikipedia.org/wiki/Comparison_of_regular-expression_engines

import json
import regex
from time import perf_counter
jsonre_v1 = regex.compile(
r'''
(?: (?P<object> \{\s* (?&members)? \s*\})
(?P<members> (?&member) ( \s*,\s* (?&member) )*)
(?P<member> (?&string) \s*:\s* (?&value))
(?P<array> \[\s* ((?&value) (\s*,\s* (?&value))*)? \s*\])
(?P<string> " [^"\\]* (?: \\. | [^"\\]* )* ")
(?P<number> (?&integer) (?&fraction)? (?&exponent)?)
(?P<integer> -? (?: 0 | [1-9][0-9]* ))
(?P<fraction> \. [0-9]*)
(?P<exponent> [eE] [-+]? [0-9]+)
){0}
(?P<value> (?&object) | (?&array) | (?&string) | (?&number) |
true | false | null )
''',
flags=regex.VERBOSE|regex.UNICODE)
jsonre_v2 = regex.compile(
r'''
(?P<value>
(?P<object> \{\s*
(?: (?P<member>(?&string) \s*:\s* (?&value))
( \s*,\s* (?&member) )* )?
\s*\})
| (?P<array> \[\s* ((?&value) (\s*,\s* (?&value))*)? \s*\])
| (?P<string> " [^"\\]* (?: \\. | [^"\\]* )* ")
| (?P<number> (?P<integer> -? (?: 0 | [1-9][0-9]* ))
(?: \. [0-9]* )?
(?: [eE] [-+]? [0-9]+ )?)
| true
| false
| null
)
''',
flags=regex.VERBOSE|regex.UNICODE)
s1 = json.dumps(
{'key': [
{'bool': True, 'number': -0.00003},
{'string': 'hello world', 'unicode': '世界,你好'}]})
s2 = ('[' * 100) + (']' * 100)
print('======= FIRST TEST =======')
print(s1)
print('jsonre_v1', 'matches' if jsonre_v1.match(s1) else 'does not match')
t1 = perf_counter()
for i in range(100):
jsonre_v1.match(s1)
t2 = perf_counter()
print(f'jsonre_v1: 100x took {t2 - t1}s')
print('jsonre_v2', 'matches' if jsonre_v2.match(s1) else 'does not match')
t1 = perf_counter()
for i in range(100):
jsonre_v2.match(s1)
t2 = perf_counter()
print(f'jsonre_v2: 100x took {t2 - t1}s')
t1 = perf_counter()
for i in range(100):
json.loads(s1)
t2 = perf_counter()
print(f'json module: 100x took {t2 - t1}s')
print()
print('======= SECOND TEST =======')
print(s2)
print('jsonre_v1', 'matches' if jsonre_v1.match(s2) else 'does not match')
t1 = perf_counter()
for i in range(100):
jsonre_v1.match(s2)
t2 = perf_counter()
print(f'jsonre_v1: 100x took {t2 - t1}s')
print('jsonre_v2', 'matches' if jsonre_v2.match(s2) else 'does not match')
t1 = perf_counter()
for i in range(100):
jsonre_v2.match(s2)
t2 = perf_counter()
print(f'jsonre_v2: 100x took {t2 - t1}s')
t1 = perf_counter()
for i in range(100):
json.loads(s2)
t2 = perf_counter()
print(f'json module: 100x took {t2 - t1}s')
print()
print('======= THIRD TEST =======')
s3 = '{"key":' + s2
print(s3)
print('jsonre_v1', 'matches' if jsonre_v1.match(s3) else 'does not match')
t1 = perf_counter()
for i in range(100):
jsonre_v1.match(s3)
t2 = perf_counter()
print(f'jsonre_v1: 100x took {t2 - t1}s')
print('jsonre_v2', 'matches' if jsonre_v2.match(s3) else 'does not match')
t1 = perf_counter()
for i in range(100):
jsonre_v2.match(s3)
t2 = perf_counter()
print(f'jsonre_v2: 100x took {t2 - t1}s')
t1 = perf_counter()
for i in range(100):
try:
json.loads(s3)
except json.JSONDecodeError:
pass
t2 = perf_counter()
print(f'json module: 100x took {t2 - t1}s')
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment