Skip to content

Instantly share code, notes, and snippets.

@latk
Last active June 20, 2021 14:08
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save latk/8388985 to your computer and use it in GitHub Desktop.
Save latk/8388985 to your computer and use it in GitHub Desktop.

Certain languages like Python use leading indentation to denote blocks. Wouldn't it be nice if a Marpa frontend like the IRIF would have support for this? Well, the problem is that implementing the Off-side Rule is similarly difficult to implementing here-docs: both are non-context free constructs, but are easy to implement by maintaining state in the scanner.

Consider this piece of Python code:

def foo(x, y):
    return x + (2*
y)

foo(1, 2) #=> 5

A line ending with ":" expects the indentation level to increase, which it does for the "return" statement. The whole body of the "foo" function would then carry this indentation. Now for the tricky part: Inside parens, newlines count as normal whitespace.

In a procedural parser I would implement a grammar similar to the following BNF:

Block ::= INDENT Statements DEDENT
Statements ::= Statement+ separator => "\n"
Statement ::= LEADING_WHITESPACE StatementBody

The tokens "INDENT", "DEDENT" and "LEADING_WHITESPACE" are treated specially by the scanner.

  • "INDENT" checks the next line's indentation and puts it on a stack.
  • "LEADING_WHITESPACE" checks the current indentation level on the top of the stack.
  • "DEDENT" is lexed when the stack's indentation level is not found at this line. The level is popped of the stack, and the "LEADING_WHITESPACE" retried. So for the above Python example, we would expect this token stream:
LEADING_SPACE "def" "foo" "(" "x" "," "y" ")" ":" INDENT
    LEADING_SPACE "return" "x" "+" "(" "2" "*" "y" ")"
    DEDENT
LEADING_SPACE "foo" "(" ...

Can we do this with Marpa? No, because the indentation stack can't be translated to non-procedural parsers. If we use events on certain tokens like "INDENT" and "LEADING_SPACE" to push and pop indents on a stack, this stack can get corrupted if a dotted rule gets aborted (thus leaving a too deep indentation level on the stack, which creates bogus "DEDENT" tokens the next time the "LEADING_WHITESPACE" is checked).

Handling here-docs has a similar problem. A natural representation of expected here-docs is a stack of terminators found on this line that is worked through on the next newline. A pointer or equivalent is then used to make the heredoc contents available to the heredoc marker token. Now Peter Stuifzand has found an interesting solution for heredocs: We keep track of logical line ends. whenever a heredoc marker is found, we look ahead to that logical line end, parse the heredoc body, and push the logical end to our heredoc's end, so that the next heredoc starts correctly. Once lexing catches up with the physical line end, we skip forward to the logical line end.

This strategy could be translated to our Offside Rule problem: At each "INDENT" token, we check the "LEADING_WHITESPACE" for the whole block, and keep track of where each line actually starts. However, this still keeps a large amount of state that can be invalidated. But worse, it may terminate the block too early: In my above Python example, the "y)" is not a statement of its own, but rather a continuation of the previous line. This is impossible to detect for the scanner without parsing the whole statement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment