latk/offside rule.md

## offside rule.md

      
    Raw
  

              offside rule.md
            
          
    Certain languages like Python use leading indentation to denote blocks. Wouldn't it be nice if a Marpa frontend like the IRIF would have support for this? Well, the problem is that implementing the Off-side Rule is similarly difficult to implementing here-docs: both are non-context free constructs, but are easy to implement by maintaining state in the scanner.
Consider this piece of Python code:
def foo(x, y):
    return x + (2*
y)

foo(1, 2) #=> 5

A line ending with ":" expects the indentation level to increase, which it does for the "return" statement. The whole body of the "foo" function would then carry this indentation. Now for the tricky part: Inside parens, newlines count as normal whitespace.
In a procedural parser I would implement a grammar similar to the following BNF:
Block ::= INDENT Statements DEDENT
Statements ::= Statement+ separator => "\n"
Statement ::= LEADING_WHITESPACE StatementBody

The tokens "INDENT", "DEDENT" and "LEADING_WHITESPACE" are treated specially by the scanner.

"INDENT" checks the next line's indentation and puts it on a stack.
"LEADING_WHITESPACE" checks the current indentation level on the top of the stack.
"DEDENT" is lexed when the stack's indentation level is not found at this line. The level is popped of the stack, and the "LEADING_WHITESPACE" retried.
So for the above Python example, we would expect this token stream:

LEADING_SPACE "def" "foo" "(" "x" "," "y" ")" ":" INDENT
    LEADING_SPACE "return" "x" "+" "(" "2" "*" "y" ")"
    DEDENT
LEADING_SPACE "foo" "(" ...

Can we do this with Marpa? No, because the indentation stack can't be translated to non-procedural parsers. If we use events on certain tokens like "INDENT" and "LEADING_SPACE" to push and pop indents on a stack, this stack can get corrupted if a dotted rule gets aborted (thus leaving a too deep indentation level on the stack, which creates bogus "DEDENT" tokens the next time the "LEADING_WHITESPACE" is checked).
Handling here-docs has a similar problem. A natural representation of expected here-docs is a stack of terminators found on this line that is worked through on the next newline. A pointer or equivalent is then used to make the heredoc contents available to the heredoc marker token. Now Peter Stuifzand has found an interesting solution for heredocs: We keep track of logical line ends. whenever a heredoc marker is found, we look ahead to that logical line end, parse the heredoc body, and push the logical end to our heredoc's end, so that the next heredoc starts correctly. Once lexing catches up with the physical line end, we skip forward to the logical line end.
This strategy could be translated to our Offside Rule problem: At each "INDENT" token, we check the "LEADING_WHITESPACE" for the whole block, and keep track of where each line actually starts. However, this still keeps a large amount of state that can be invalidated. But worse, it may terminate the block too early: In my above Python example, the "y)" is not a statement of its own, but rather a continuation of the previous line. This is impossible to detect for the scanner without parsing the whole statement.