Skip to content

Instantly share code, notes, and snippets.

@latk
Created January 10, 2014 20:25
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save latk/8361936 to your computer and use it in GitHub Desktop.
Save latk/8361936 to your computer and use it in GitHub Desktop.
MarpaX::Interface::Inline Specification

MarpaX::Interface::Inline Specification

The MarpaX::Interface::Inline module provides the Inline Rule Interface (IRIF) to write grammars for the Marpa parser generator. While Marpa already provides the Scanless Interface (SLIF), the IRIF tries to provide a more productive frontend:

  • Longest-Acceptable-Token Matching: The builtin lexer is smart and only tries to find tokens that can actually be used at the current position. This makes it easier to nest languages.
  • Inline Rules and Quantifiers: A regex like syntax allows you to specify rules where you need them – and not every rule has to be named. Quantifiers include support for Perl6's separation operator %.
  • Inline Actions: Each production is associated with an action that builds the AST. With the IRIF, these are specified inline, directly after the rule which they operate on.
  • Short Actions: Some abbreviations for often-used actions exist.
  • Captures: You can capture the values of rules by assigning them to variables. These variables are then automatically available in the action.
  • Macros: Generate similar rules from a template.
  • Grammar Imports: Import rules from another grammar.
  • Regex Tokens: The tokenizer can use Perl regexes, strings, and even Procedural Tokens where user code completely takes over parsing.
  • Hooks and Assertions: Execute user code before and/or after a rule derivation. Inside a rule: Check the surrounding input to assert constraints.
  • Offside Rule Matcher: Languages like Markdown, Python, and Haskell (and of course ISWIM) are not context-free languages because they use intendation to specify scopes. This means that Marpa can't parse them … unless we transform changes in line prefixes into separate tokens.
  • Grammar Class Generator: The grammar is made accessible as a Perl package.

Here is a small calculator language that showcases some basic features:

grammar Acme::Calculator

TOP:
    |   SKIP $value=Expression SKIP => $value

Expression:
    |   $x=THIS '+' $y=NEXT => {{ $x +  $y }}
    |   $x=THIS '-' $y=NEXT => {{ $x -  $y }}
    ||  $x=THIS '*' $y=NEXT => {{ $x *  $y }}
    |   $x=THIS '/' $y=NEXT => {{ $x /  $y }}
    ||  $x=NEXT~'^'~$y=THIS => {{ $x ** $y }}
    ||  $n=qr/\d+/ => $n
    |   '(' $value=Expression ')' => $value

What's this?

  • grammar Foo declares that this grammar generates the Perl package Foo.
  • TOP: declares the start rule.
  • Alternatives can optionally be introduced by a pipe |. Think “pattern matching” in Haskell.
  • The SKIP rule is predeclared to match whitespace. It is auto-inserted between other rules, so we only need to specify it at the beginning (and for symmetry: the end). The insertion can be suppressed with the ~ operator.
  • $value=Foo declares a capture of the Foo rule's value. It can be used in the action.
  • => $value is an action that simply evaluates to the contents of that capture.
  • THIS and NEXT are used for precedence control. While THIS matches the current precedence level, NEXT matches the more tightly binding levels. New precedence levels are declared with ||, while alternatives inside the same precedence level are separated with | (note that | is still an ordered choice).
  • => {{ ... }} is a custom action block which can include almost arbitrary Perl code. It will be compiled to a subroutine. Code will be injected so that the names of captures are available inside.
  • Foo~Bar means that no whitespace is allowed between the Foo and Bar rules. I.e., this supresses a SKIP insertion.
  • qr/\d+/ is a regex. This makes easy things (like lexing a number) actually easy.

Up for a more complex example? Let's look at a JSON parser:

# see "ECMA 404" for the JSON standard, or <json.org> for a summary

TOP: SKIP $=Value SKIP
    
SKIP: m/[\t\n\r\ ]+/

Value:
    |   "{" $data=($key=String ":" $val=Value => [$key, $val])* % "," "}" => {{
            my %hash = map { @$_ } @$data;
            return \%hash;
        }}
    |   "[" $=Value* % "," "]" => $
    |   $x=m/[-]? (?:0 | [1-9][0-9]*) (?:[.][0-9]+)? (?:[eE][+-]?[0-9]+)?/x => {{ 0+$x }}
    |   $=String => $
    |   "true"  => {{ 1 }}
    |   "false" => {{ 0 }}
    |   "null"  => undef
    
String: '"' ~ $chars=StringChar+  ~ '"' => {{ join "" => @$chars }}

StringChar:
    |   "\\u" ~ $hex=m/[0-9A-Fa-f]{4}/ => {{ chr hex $hex }} # technically, this is not correct outside the BMP
    |   "\\" ~ $=($escaped=m<[\"\/\\]> => $
                 | "b" => "\b" | "f" => "\f" | "n" => "\n"
                 | "r" => "\r" | "t" => "\t") => $
    |   $=m/[^\\"]+/ => $

Whats new here?

  • $= is a capture without a name.
  • The TOP doesn't have an action? Actually, it does, and the appropriate action is simply guessed based on the number of captures.
  • I thought SKIP was predeclared? Yes, it is, but we can re-declare it. This is useful when we want to match comments, or to control what precisely is considered whitespace. Only one modification is added to SKIP: It is always optional, and multiple SKIPs can be in sequence.
  • Regexes can be declared in a variety of ways: With m/ or qr/ or r/ or rx/. Delimiters include slashes and balanced parens. Modifiers like /x can also be used (/u is applied by default).
  • Foo* % "," is Foo repeated nonce or more, separated by commas. When a repetition is captured, the capture contains an array reference of the repeated rule's values.
  • ($key=String ":" $val=Value => [$key, $val]) is a nested rule. We could also name a nested rule like (Foo: ...).
  • => $ is a short action: Gimme a scalar. The => undef is another short action, as is => [$key, $val] or => "\n".

A JSON parser in 20 LOC is cool. Whipituptitude is cool. Of course that was a solved problem, but think about how easily you can parse $new_DSL with such a powerful parser interface.

Syntax & Semantics

Each grammar consists of the following parts:

  • A grammar name.
  • Optionally, an Init Block.
  • One or more rule declarations and other declarations which must include a TOP rule.

Any Unicode whitespace between items in the grammar is ignored.

Comments are as in Perl or sh: introduced by an octothorpe they go on until the end of the line. Additionally, ML-style nested comments (* ... *) can be used.

rule TOP:
    SKIP
    $=Grammar+
    SKIP;
rule Grammar:
    |   "grammar" $name=PackageName '{'  $imports=Import* $init=CodeBlock? $rules=Declaration+ '}'
    |   "grammar" $name=PackageName ';'? $imports=Import* $init=CodeBlock? $rules=Declaration+
rule Declaration:
    |   "rule"? $=RuleDeclaration
    |   "action"? $=ActionDeclaration
    |   "token"? $=ActionDeclaration
rule SKIP:
    |   (Whitespace:    m/\s+/)
    |   (LineComment:   m/\#[^\n]*/)
    |   (NestedComment: m/( [(][*] (?> [^(*]++ | [(](?![*]) | [*](?![)]) | (?-1) )* [*][)] )/x)

Rules and Macros

Rules and macros essentially share all their syntax, and a rule can be interpreted as an argument-less macro.

A macro is declared as a symbol with an argument list, followed by multiple alternatives. Each alternative is optionally introduced by | or || where | denotes a simple alternative, and || decreases the precedence level. Inside each precedence level, the special rules THIS and NEXT are available for precedence-aware recursion. Whereas THIS refers to the current precedence level, NEXT refers to the next-tighter one, which follows after the || operator. Besides from managing precedence, | and || are essentially the same operator and define an ordered choice. While all alternatives are matched equally, in case of an ambiguity only the first choice is evaluated. This is what's usually desirable in the context of programming languages.

Macro variables can be interpolated on an AST level with `name` in place of a rule, or "foo `name`"-style in strings, regexes, or actions.

RuleDeclaration:
    $name=SimpleName $signature=('(' $=SimpleName+ % ',' ')')? ':' $rule=Rules ';'?
InlineRuleDeclaration:
    $name=($=SimpleName ':')? $rule=Rules
Rules:
    ||  '||'? $=NEXT $=THIS?
    ||  '|'?  $=NEXT $=THIS?
    ||  $=Capture $=($='~'? $=Capture)* $=('=>' $=Action)?
Capture:
    |   $=($=Variable ~ '=')? ~ $=RuleReference
RuleReference:
    |   SequenceRuleReference
    |   ScalarRuleReference

Sequence Rules

Sequence rules are sequences with quantifiers ?, + and * which work as expected. The repetition forms can have a delimiter specified after % (strict separation operator) or %% (trailing separator allowed). A SKIP rule is automatically inserted to the left and right of the separator. To suppress this on either side, add a ~ directly adjacent to the % or %% operator. Here is an example of various combinations:

# "foo"+ ~%%~ "," ~ "bar
foo,foo,bar
# "foo"+ ~%%~ ","   "bar
foo,foo, bar
# "foo"+  %%~ "," ~ "bar
foo ,foo ,bar
# "foo"+ ~%%  "," ~ "bar
foo, foo, bar
# "foo"+  %%  "," ~ "bar
foo , foo , bar

Note that it is not currently possible to capture the separator's value.

SequenceRuleReference:
    |   $=ScalarRuleReference~'+' $=SequenceSeparator?
    |   $=ScalarRuleReference~'*' $=SequenceSeparator?
SequenceSeparator:
    |   '~'? ~ '%%' ~ '~'? ScalarRuleReference
    |   '~'? ~ '%'  ~ '~'? ScalarRuleReference

Scalar Rules are more simple rule references. For precedence reasons this includes the ? quantification. Otherwise, this can be separated into:

  • Tokens
  • References like rule names, macro variables, or inline rule declarations
  • Assertions
ScalarRuleReference:
    |   $=NEXT~'?'
    ||  '(' $= RuleReference ')'
    |   Assertion
    |   Reference
    |   Token

Tokens

A token is an atomic unit of a grammar that can be recognized by the lexer. There are three kinds of tokens: regexes, strings, and procedural tokens.

  • Regexes are of the Perl5 kind, but should be used sparingly. Embedded code is not available. A number of delimiters and modifiers can be used, and the /u Unicode semantics modifier is enabled by default. Regexes must be introduced with some regex operator, be it m, r, rx or qr. Strings can be interpolated into regexes by using macros, and using `name` interpolation.

  • Strings are either single-quoted or double-quoted strings. Single-quoted strings offer no escapes except for the backslash and the single quote. Lone backslashes that don't actually escape anything are treated as an error. Double-quoted strings offer more escapes:

    • \0, \a, \b, \t, \n, \v (vertical tab, not offered by Perl), \f, \r, \e (escape), \s (space), \\
    • \` to include backticks without triggering macro interpolation
    • \xFF or \x{FFEF} to include specific codepoints (not bytes). The braced form can use variable-length hex numbers, but currently no more than five digits make sense.
    • \cX where X is an ASCII printable character that is XORed with 64, e.g. \c@ = \0
    • \N{charname} for named Unicode characters. Perl's \N{U+XXXX} is served by the \x escapes.

    Not available are case modifications like \L or octal escapes (ewww).

  • Procedural Tokens are a special piece of code that can take over parsing at the current position. It gets passed:

    • A reference to the input string. The string must not be modified at all.
    • The current position

    On success, the code should return two items: The value of this token, and the length of the consumed substring. This is important for longest-token-matching.

    On error, nothing should be returned. This should be implemented as a bare return: return;.

    The code should not have any side effects, so it shouldn't matter if it get called repeatedly at the same position. Even if the code successfully parsed at this location, the result may be discarded if a longer token was found. Note that it's not possible to simply return a longer substring length to increase the “priority” of this token, as that number determines where the next match starts.

    If possible, the restriction on modifying the input string might be lifted in the future which would allow something like the Perl parser API's lex_stuff. One can dream.

Token:
    |   m/m|rx?|qr/ ~ $=RegexBody ~ $=m/[alupimsx]*/
    |   String
    |   ProceduralToken

Rule References

One can match a rule at a certain position by either providing that rule's name, or by specifying it in-place as an inline rule.

Names are either normal names that refer to this grammar's rules Foo or to another grammar's rules Other::Foo or to macro arguments `foo`.

A simple name is a sequence of letters and digits, where the first character is a letter. A package name is a sequence of simple names separated by a double colon. Macro names are simple names in backticks. Variable names and action names are simple names with a sigil. All names that are at least two characters long and only consist of uppercase characters (more precisely: non-lowercase letters) are reserved. As a naming convention, rules should use UpperCamelCase with $snake_case for captures and lowerCamelCase for macro variables. Acronyms should be lowercased: IdNumber and HtmlTag. This may be changed in future versions if hyphens become legal in identifiers.

Reference:
    |   PackageName
    |   '`' ~ SimpleName ~ '`'
    |   '(' $=InlineRuleDeclaration ')'

Assertions

This feature is still in design: Would assertions be more valuable during scanning (e.g. lookaheads, lookbehinds) or during evaluation where captures are available (e.g. checking that variables are declared or numbers are in a range)? Until then, the syntax is simply reserved.

Assertion:
    |   '!' ~ CodeBlock
    |   '?' ~ CodeBlock

Actions

A rule option can be followed by an action specification. We distinguish between:

  • Short Actions are shortcuts for common cases
  • Inline Actions are code blocks directly there
  • Action References refer to subroutines that are declared at another position.

An Inline Action is simply a standard code block. All captures declared in that option will be available as Perl variables. If captures are declared without a name, they will be automatically given a name, starting from $a and ending at $z, with more causing an error prompting you to refactor the grammar. Macro variables can be interpolated into the action before being compiled. If the action code would need to use the Perl backticks operator, it should usen qx// or a proper way to interface with other programs like open or system.

An action reference is just a name like &foo. The signature of the action and the option must be the same, where the signature is the number of captures.

Here is an overview over short actions:

  • Undef: undef or _ evaluate to a Perl undef.

  • Scalar: the $ short action returns the only capture's value. This will cause a compile-time error if more than one capture was used (also if no captures). The named version $foo returns that captures value. This won't warn with additional captures.

  • Array: the @ short action returns an array reference containing all captures in the order which they were declared. If no captures where specified this returns the empty array reference. With the explicit form [$foo, $bar] the captures may be re-ordered. The empty explicit form [] is also allowed.

  • Hash: % creates a hash of all named captures from their names to their values. The explicit form can also be used (JSON-style): {foo: $foo, bar: $bar}.

  • Object Constructor:

    • new($foo, $bar) calls the new method on the package name that is obtained by joining the grammar name, Ast and the LHS of the rule with double colons. For example, grammar Foo::Bar Baz: $x="foo" => new($x) is translated into the call Foo::Bar::Ast::Baz->new($x).
    • It is also possible to specify a package name, like new Other::Class($x).
    • If this package name begins with a double colon, the grammar's Ast namespace is prepended. E.g.: grammar Foo::Bar Baz: $x="foo" => new ::Qux($x) translates to Foo::Bar::Ast::Qux->new($x).

    The argument list can either be an array list $x, $y or a hash list foo: $x, bar: $y but mixing is not currently possible. For constructing the correct package name it is not relevant whether the given package name contains colons, only the start is considered for this.

  • String: A string as described for tokens can also be used.

  • Number: … as can numbers. Available are simple decimal or hexadecimal integers, plus floats with engineering notation.

There are possible future additions to this: aribitrary nesting of explicit short actions, hash and array subscripting, method and function calls, splat stars for flattening, slicing, …. However, that would result in a data language of its own.

What happens when no action was specified? The action is taken from the next alternative. If the next alternative has not specified or inherited an action, then the correct action is guessed from the signature: zero captures result in undef, one in $ and more in @ behaviour. A warning is emitted when actions are inherited across precedence boundaries.

Action declarations

Actions can be named in order to further re-use. These declarations always happen on the top level, and look similar to macro declarations, e.g.:

&some_action($foo, $bar, $) {{
    ... # normal perl code here
}}

The action signature can specify captures that are ignored. This is useful because the action signature must fit the option signature.

Incidentally the same syntax is used for named procedural tokens, but those don't have argument lists.

XXX

Predeclared Rules and Macros

Every all-uppercase identifier or identifier part is reserved for internal uses.

Precedence Control with NEXT and THIS

NEXT refers to the next tighter precedence level, while THIS refers to the current level.

ERROR() reporting

The ERROR builtin macro takes one string as an argument. When an ERROR is encountered in the tokenizer, the error message is registered at the current position. Only when no other tokens where matched, are the available error messages emitted instead of the default error message.

The symbol will never be made available to Marpa, thus aborting any rule where it occurs.

If the empty string is given to this macro, a default text will be substituted: “No expected tokens were found at this position”.

Lexer error format

foo.txt:10:8 Error while lexing
 1. Some user-provided explanation giving a description in the problem-domain.
The explanations have a first-line indent or are numbered and are wrapped at
column 78.
 2. Another explanation at the same position.

 9: .......
10:     foo qux
            ^
17: .......

Expected any of:
    BarRule starting with "bar"

The expected stuff is a top-down list of all predicted rules and expected tokens.

The code snippet numbers each line and prints it out verbatim. The line is separated from the line number by a simple colon followed by a space. Line numbers are right-justified. Tabs in the verbatim are translated to four spaces. The number of context lines can be changed and defaults to one.

If the verbosity is increased by some setting, then the current location in the grammar is emitted, with the current location(s) noted with an <-HERE--> marker.

Also, the characters before and after the current position are output like:

before error: \s.....\n\tfoo\s
after error:  qux\n\s....

with a length that defaults to 78 - length("before error: ") = 64 visible characters. In the displayed string, any special characters are shown in their escaped form. At the beginning and end, spaces are escaped to \s but not on the inside.

Grammar Inheritance: SUPER and INNER

unspecified for now.

Offside Rule: INDENT, KEEP_INDENT, DEDENT, and LEADING_SPACE

The INDENT(lead) and DEDENT tokens are offered by the lexer when leading whitespace changes. For example:

a:
  b
  c:
    d
e

could become the token stream "a", INDENT, "b", "c", INDENT, "d", DEDENT, DEDENT, "e".

The INDENT is actually a macro which takes a token as argument. After existing indents have been subtracted, this token is tried to match. Indent checking is only performed when the LEADING_SPACE pseudo-token is expected (which succeeds when all expected indents were provided). The KEEP_INDENT is a variant of INDENT that remembers the string for the new indent and requires a literal match of that instead of the token that was passed in – useful when using regexes.

The grammar for above language would be:

rule TOP:
    |   SKIP Statement+ SKIP
rule SKIP: m/[^\S\n]+/

rule Statement:
    |   LEADING_SPACE $name=m/\w+/ (":" KEEP_INDENT(m/[^\S\n]+/) Statement+ DEDENT)?

LEADING_SPACE, INDENT and KEEP_INDENT will also swallow a preceding newline, which defaults to \n.

Debugging Helpers: ONLY, DUMP_LEXER_STATE

ONLY codifies the assumption that no other terminals were requested at this position, and that this is thus the only alternative being considered by Marpa. Otherwise you'll get an exception.

DUMP_LEXER_STATE(msg) is similar to ERROR, more verbose, but nonfatal. It is intended for debugging only.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment