Skip to content

Instantly share code, notes, and snippets.

@washort
Last active December 11, 2015 08:58
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save washort/4576202 to your computer and use it in GitHub Desktop.
Save washort/4576202 to your computer and use it in GitHub Desktop.

Now that you are familiar with the basics of Parsley syntax, let's look at a more realistic example: a JSON parser.

The JSON spec on http://json.org/ describes the format, and we can adapt its description to a parser. We'll write the Parsley rules in the same order as the grammar rules in the right sidebar on the JSON site, starting with the top-level rule, 'object'. :: object = token('{') members:m token('}') -> dict(m)

Here we introduce the builtin rule token, which consumes all leading whitespace and matches the string passed to it. Rules in parsley can take arguments in parentheses. We'll look at an example of how to define parameterized rules later on.

Since JSON objects are represented in Python as dicts, and dict takes a list of pairs, we need a rule to collect name/value pairs inside an object expression. :: members = (pair:first (token(',') pair)*:rest -> [first] + rest) | -> []

This handles the three cases for object contents: one, multiple, or zero pairs. A name/value pair is separated by a colon:

string:k token(':') value:v -> (k, v)

Arrays, similarly, are sequences of array elements, and are represented as Python lists. :: array = token('[') elements:xs token(']') -> xs elements = (value:first (token(',') value)*:rest -> [first] + rest) | -> []

Values can be any JSON expression. :: value = (string | number | object | array | token('true') -> True | token('false') -> False | token('null') -> None)

Strings are sequences of zero or more characters between double quotes. Of course, we need to deal with escaped characters as well. This rule introduces the operator ~, which does negative lookahead; if the expression following it succeeds, its parse will fail. If the expression fails, the rest of the parse continues. Either way, no input will be consumed. :: string = token('"') <(escapedChar | ~'"' anything)*>:c '"' -> c

This is a common pattern, so let's examine it step by step. This will match leading whitespace and then a double quote character. It then matches zero or more characters. If it's not an escapedChar (which will start with a backslash), we check to see if it's a double quote, in which case we want to end the loop. If it's not a double quote, we match it using the rule anything, which accepts a single character of any kind, and continue. Finally, we match the ending double quote and return the characters in the string.

It's very common to use ~ for "match until" situations where you want to keep parsing only until an end marker is found. Similarly, ~~ is positive lookahead: it succeed if its expression succeeds but not consume any input.

The escapedChar rule should not be too surprising: we match a backslash then whatever escape code is given. :: escapedChar = '\' ('"' -> '"' '/' -> '/' 'f' -> 'f' 'r' -> 'r' ''' -> ''' | escapedUnicode)

Unicode escapes (of the form u2603) require matching four hex digits, so we use the repetition operator {}, which works like + or * except taking either a {min, max} pair or simply a {number} indicating the exact number of repetitions. :: hexdigit = :x ?(x in '0123456789abcdefABCDEF') -> x escapedUnicode = 'u' <hexdigit{4}>:hs -> unichr(int(hs, 16))

With strings out of the way, we advance to numbers, both integer and floating-point. :: number = ('-' | -> ''):sign (intPart:ds (floatPart(sign ds) | -> int(sign + ds)))

Here we vary from the json.org description a little and move sign handling up into the number rule. We match either an intPart followed by a floatPart or just an intPart by itself. :: digit = :x ?(x in '0123456789') -> x digits = <digit*> digit1_9 = :x ?(x in '123456789') -> x

intPart = (digit1_9:first digits:rest -> first + rest) | digit floatPart :sign :ds = <('.' digits exponent?) | exponent>:tail -> float(sign + ds + tail) exponent = ('e' | 'E') ('+' | '-')? digits

In JSON, multi-digit numbers cannot start with 0 (since that is Javascript's syntax for octal numbers), so intPart uses digit1_9 to exclude it in the first position.

The floatPart rule takes two parameters, sign and ds. Our number rule passes values for these when it invokes floatPart, letting us avoid duplication of work within the rule. Note that pattern matching on arguments to rules works the same as on the string input to the parser. In this case, we provide no pattern, just a name: :ds is the same as anything:ds.

(Also note that our float rule cheats a little: it does not really parse floating-point numbers, it merely recognizes them and passes them to Python's float builtin to actually produce the value.)

The full version of this parser and its test cases can be found in the examples directory in the Parsley distribution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment