/ParsingProblem.swim Secret

## ParsingProblem.swim
This problem involves parsing a *syntax* that defines a *block* of code
composed of two or more *statements*. The syntax supports 2 different ways to
define the block. The problem is how to determine when a statement ends.

Here is the example code written in Ruby:

  def f(x, y)
    a = x * y
    a + 1
  end

This is a function named `f` that has a block containing 2 statements.

In the problem syntax there are 2 main ways to write this:

* Indentation-based:

  \f(x, y)
    a = x * y
    a + 1

* Bracket-based:

  \f(x, y) { a = x * y; a + 1 }

There are many permutations of these. For example the bracket based one can be:

  \f(x, y) { a
  = x
  * y
  ; a
  + 1
  }

The driving principle here is that when you have braces, statements are
terminated by semicolons, or the closing brace. This gives you lots of
formatting freedom.

In indentation based scoping, statements are terminated by an EOL *after* a
complete construct. This is to say, that you can split a statement into 2
lines, if the first line cannot be valid on its own:

  \f(x, y)
    a = x *
      y
    a + 1

Let's look at Ruby, which is a typical syntax that tries to avoid semicolons
but is forced to support backslash continuation. Note: this is getting close to
the heart of this problem.

This is valid Ruby:

  def f(x, y)
    a = x *
      y
    a + 1
  end

The first statement can be split because the first half is not complete.

This is invalid Ruby:

  def f(x, y)
    a = x
      * y
    a + 1
  end

The first half is valid. The line ends, so the statement ends. The second part
is invalid, so we have a syntax error.

The fix here is a continuation marker:

  def f(x, y)
    a = x \
      * y
    a + 1
  end

This makes sense, because Ruby is a *bracketed* syntax with no mandatory
statement terminator (semicolon).

Here's the *new* idea that is not supported anywhere yet (afaik). If you are
using indentation scoping, you can also use indentation for statement
continuation:

  \f(x, y)
    a = x
      * y
    a + 1

The parser can look 1 token ahead for an indent, and take that to mean a
continuation, if indentation is not valid for some other reason. This is a
really nice addition to indentation based scoping. Note: I've suggested it to
the CoffeeScript project, but they've not used it yet.

OK, that's the groundwork. The problem that vexes me is how to express this
cleanly in a grammar.

----

= Strawman Solution #1

Indentation has 4 distinct tokens:

- indent :: start an indentation block
- ondent :: match current indentation on a line
- undent :: end an indentation block
- andent :: a single line indented more than current indentation

So `andent` is what we use for continuation. If we do that, we simply make
`andent` a part of whitespace (token separation) while in the
indentation-honoring scope.

So then we have 3 grammars (classes):

- The main grammar
- The grammar for `{}` blocking
- The grammar for indent blocking

with possibly different whitespace meaning in each. Here's the grammar. It's
stripped down to the problem at hand. (Note: using Pegex syntax here, but should
be easily understandable):

  %grammar main
  code: func-def*
  func-def: '\' func-name arg-list? code-body
  code-body: bracket-code-body | indent-code-body
  assignment: symbol + '=' + expression
  expression: symbol + binary-operator + symbol

  %grammar indentation
  indent-code-body: indent statement+ undent
  statement: ( assignment | expression ) EOL
  ws: comment | SPACE | EOL andent

  %grammar bracketed
  bracket-code-body: '{' ( statement* %% ';' ) '}'
  statement: assignment | expression
  ws: comment | SPACE | EOL

Pegex Note: `+` means `ws+` (1 or more whitespace matches). Also `-` means
`ws*` but that's not used in theis example.

The important thing is that you don't want to duplicate every rule under
something as high level as a code block. You also don't want to have `andent`
as a token in the rules because it could be almost anywhere. By mixing it into
whitespace matching (if this is really possible), we win, because we have to
account for that anyway and it has the minimal syntax helpers of (`-` and `+`).
	This problem involves parsing a syntax that defines a block of code
	composed of two or more statements. The syntax supports 2 different ways to
	define the block. The problem is how to determine when a statement ends.

	Here is the example code written in Ruby:

	def f(x, y)
	a = x * y
	a + 1
	end

	This is a function named `f` that has a block containing 2 statements.

	In the problem syntax there are 2 main ways to write this:

	* Indentation-based:

	\f(x, y)
	a = x * y
	a + 1

	* Bracket-based:

	\f(x, y) { a = x * y; a + 1 }

	There are many permutations of these. For example the bracket based one can be:

	\f(x, y) { a
	= x
	* y
	; a
	+ 1
	}

	The driving principle here is that when you have braces, statements are
	terminated by semicolons, or the closing brace. This gives you lots of
	formatting freedom.

	In indentation based scoping, statements are terminated by an EOL after a
	complete construct. This is to say, that you can split a statement into 2
	lines, if the first line cannot be valid on its own:

	\f(x, y)
	a = x *
	y
	a + 1

	Let's look at Ruby, which is a typical syntax that tries to avoid semicolons
	but is forced to support backslash continuation. Note: this is getting close to
	the heart of this problem.

	This is valid Ruby:

	def f(x, y)
	a = x *
	y
	a + 1
	end

	The first statement can be split because the first half is not complete.

	This is invalid Ruby:

	def f(x, y)
	a = x
	* y
	a + 1
	end

	The first half is valid. The line ends, so the statement ends. The second part
	is invalid, so we have a syntax error.

	The fix here is a continuation marker:

	def f(x, y)
	a = x \
	* y
	a + 1
	end

	This makes sense, because Ruby is a bracketed syntax with no mandatory
	statement terminator (semicolon).

	Here's the new idea that is not supported anywhere yet (afaik). If you are
	using indentation scoping, you can also use indentation for statement
	continuation:

	\f(x, y)
	a = x
	* y
	a + 1

	The parser can look 1 token ahead for an indent, and take that to mean a
	continuation, if indentation is not valid for some other reason. This is a
	really nice addition to indentation based scoping. Note: I've suggested it to
	the CoffeeScript project, but they've not used it yet.

	OK, that's the groundwork. The problem that vexes me is how to express this
	cleanly in a grammar.

	----

	= Strawman Solution #1

	Indentation has 4 distinct tokens:

	- indent :: start an indentation block
	- ondent :: match current indentation on a line
	- undent :: end an indentation block
	- andent :: a single line indented more than current indentation

	So `andent` is what we use for continuation. If we do that, we simply make
	`andent` a part of whitespace (token separation) while in the
	indentation-honoring scope.

	So then we have 3 grammars (classes):

	- The main grammar
	- The grammar for `{}` blocking
	- The grammar for indent blocking

	with possibly different whitespace meaning in each. Here's the grammar. It's
	stripped down to the problem at hand. (Note: using Pegex syntax here, but should
	be easily understandable):

	%grammar main
	code: func-def*
	func-def: '\' func-name arg-list? code-body
	code-body: bracket-code-body \| indent-code-body
	assignment: symbol + '=' + expression
	expression: symbol + binary-operator + symbol

	%grammar indentation
	indent-code-body: indent statement+ undent
	statement: ( assignment \| expression ) EOL
	ws: comment \| SPACE \| EOL andent

	%grammar bracketed
	bracket-code-body: '{' ( statement* %% ';' ) '}'
	statement: assignment \| expression
	ws: comment \| SPACE \| EOL

	Pegex Note: `+` means `ws+` (1 or more whitespace matches). Also `-` means
	`ws*` but that's not used in theis example.

	The important thing is that you don't want to duplicate every rule under
	something as high level as a code block. You also don't want to have `andent`
	as a token in the rules because it could be almost anywhere. By mixing it into
	whitespace matching (if this is really possible), we win, because we have to
	account for that anyway and it has the minimal syntax helpers of (`-` and `+`).