-
-
Save anonymous/32ce47462616a6e99d2a to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This problem involves parsing a *syntax* that defines a *block* of code | |
composed of two or more *statements*. The syntax supports 2 different ways to | |
define the block. The problem is how to determine when a statement ends. | |
Here is the example code written in Ruby: | |
def f(x, y) | |
a = x * y | |
a + 1 | |
end | |
This is a function named `f` that has a block containing 2 statements. | |
In the problem syntax there are 2 main ways to write this: | |
* Indentation-based: | |
\f(x, y) | |
a = x * y | |
a + 1 | |
* Bracket-based: | |
\f(x, y) { a = x * y; a + 1 } | |
There are many permutations of these. For example the bracket based one can be: | |
\f(x, y) { a | |
= x | |
* y | |
; a | |
+ 1 | |
} | |
The driving principle here is that when you have braces, statements are | |
terminated by semicolons, or the closing brace. This gives you lots of | |
formatting freedom. | |
In indentation based scoping, statements are terminated by an EOL *after* a | |
complete construct. This is to say, that you can split a statement into 2 | |
lines, if the first line cannot be valid on its own: | |
\f(x, y) | |
a = x * | |
y | |
a + 1 | |
Let's look at Ruby, which is a typical syntax that tries to avoid semicolons | |
but is forced to support backslash continuation. Note: this is getting close to | |
the heart of this problem. | |
This is valid Ruby: | |
def f(x, y) | |
a = x * | |
y | |
a + 1 | |
end | |
The first statement can be split because the first half is not complete. | |
This is invalid Ruby: | |
def f(x, y) | |
a = x | |
* y | |
a + 1 | |
end | |
The first half is valid. The line ends, so the statement ends. The second part | |
is invalid, so we have a syntax error. | |
The fix here is a continuation marker: | |
def f(x, y) | |
a = x \ | |
* y | |
a + 1 | |
end | |
This makes sense, because Ruby is a *bracketed* syntax with no mandatory | |
statement terminator (semicolon). | |
Here's the *new* idea that is not supported anywhere yet (afaik). If you are | |
using indentation scoping, you can also use indentation for statement | |
continuation: | |
\f(x, y) | |
a = x | |
* y | |
a + 1 | |
The parser can look 1 token ahead for an indent, and take that to mean a | |
continuation, if indentation is not valid for some other reason. This is a | |
really nice addition to indentation based scoping. Note: I've suggested it to | |
the CoffeeScript project, but they've not used it yet. | |
OK, that's the groundwork. The problem that vexes me is how to express this | |
cleanly in a grammar. | |
---- | |
= Strawman Solution #1 | |
Indentation has 4 distinct tokens: | |
- indent :: start an indentation block | |
- ondent :: match current indentation on a line | |
- undent :: end an indentation block | |
- andent :: a single line indented more than current indentation | |
So `andent` is what we use for continuation. If we do that, we simply make | |
`andent` a part of whitespace (token separation) while in the | |
indentation-honoring scope. | |
So then we have 3 grammars (classes): | |
- The main grammar | |
- The grammar for `{}` blocking | |
- The grammar for indent blocking | |
with possibly different whitespace meaning in each. Here's the grammar. It's | |
stripped down to the problem at hand. (Note: using Pegex syntax here, but should | |
be easily understandable): | |
%grammar main | |
code: func-def* | |
func-def: '\' func-name arg-list? code-body | |
code-body: bracket-code-body | indent-code-body | |
assignment: symbol + '=' + expression | |
expression: symbol + binary-operator + symbol | |
%grammar indentation | |
indent-code-body: indent statement+ undent | |
statement: ( assignment | expression ) EOL | |
ws: comment | SPACE | EOL andent | |
%grammar bracketed | |
bracket-code-body: '{' ( statement* %% ';' ) '}' | |
statement: assignment | expression | |
ws: comment | SPACE | EOL | |
Pegex Note: `+` means `ws+` (1 or more whitespace matches). Also `-` means | |
`ws*` but that's not used in theis example. | |
The important thing is that you don't want to duplicate every rule under | |
something as high level as a code block. You also don't want to have `andent` | |
as a token in the rules because it could be almost anywhere. By mixing it into | |
whitespace matching (if this is really possible), we win, because we have to | |
account for that anyway and it has the minimal syntax helpers of (`-` and `+`). |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment