NamelessCoder/LexerBasedFluid.md

## LexerBasedFluid.md

      
    Raw
  

              LexerBasedFluid.md
            
          
    The following are examples of new syntax capabilities that will/may become possible if switching Fluid
away from regular expression parsing, to a lexer yielding a stream of tokens that can then be parsed
to become a Fluid "syntax tree".
The lexer is a research project, in progress, but nearing completion. I plan to combine it with a
"streaming node parser" which only processes the tokens it must, as opposed to processing everything
like the current Fluid parser does.
Some of these already work - most of them are being researched
The examples

Recursive accessor nodes

Inline expressions, including ViewHelper calls, can be nested to any depth and does not require quoting
the variable accessor when using dynamic parts in VieWHelper arguments.
- {variable.{sub}}
- {variable.{v:h()}}
- {v:h(arg: variable.{sub})}
- {v:h('{v:key()}': value)}

Arbitrary inline or tag mode arguments

Inline syntax supports the same method of passing arguments as tag mode and supports value-less referencing:
- {v:h(arg="value")}
- {v:h(arg="{value}" arg2="{value2}")}
- {v:h(arg=value arg2=value2)}
- {v:h(arg arg2)} // Passes variables "arg" and "arg2" as values for arguments "arg" and "arg2"

Optional arguments separator for arrays

The , arguments seperator can easily be made optional when using a lexer:
{v:h(arg: value arg2: value)} # ...but of course at this point it makes more sense to use:
{v:h(arg="{value}" arg2="{value}")} # since this is a bit more readable at the cost of needing quotes,
{v:h(arg=value arg2=value)} # or this variant which means "the value is a variable reference"

Using the same syntax for both inline and tag mode makes it much easier to convert a ViewHelper
call from inline to tag notation and vice versa, so this is also a goal of using the lexer.
New array syntaxes

New ways of passing an array of values can be added:
<v:h arr="[a, b, c]" /> // creates: ["a" => $a, "b" => $b, "c" => $c]

A new "inline pass" operator

Since lexers work best when analyzing a single character at a time, the current -> inline operator is less
than ideal. It also has the possibility to be confused with part of an XML tag. Switching the inline pass
operator to | solves this:
{variable -> v:h()} same as {variable | v:h()} but aiming to deprecate the former.

Reducing need to quote ViewHelpers

Currently you have to quote (in single or double quotes) any calls to ViewHelpers when you use inline syntax
while building an array. Quoting, and in particular the need to escape quotes, can be reduced by not requiring
ViewHelper calls to be quoted; but only when building arrays (does not make sense elsewhere):
<v:h arr="{key: v:h()}" />

Forbidding tag mode in attributes

Using tag mode in attributes can be forbidden, ensuring developers will write valid templates.
<v:h arg="<v:h2 />" /> # throws error
<v:h arg="{v:h2()}" /> # does not

Reducing redundant syntax parts

It is possible to remove the need to add () to inline ViewHelper calls when said ViewHelper call has no
arguments. Instead, a shorter syntax can be made possible:
{v:h}
{v:h(arg: v:h)}
{variable -> v:h}
{variable | v:h} # see above!

Expression marker

In order to clearly identify expressions (such as {variable as array}) as different from normal variable
accessors or inline ViewHelper syntax, a small helper character can be used:
{@complex expression, (), [] etc. captured, only terminates by curly brace}
<v:h arr="{key: @complex-expression}" /> # whitespace-less expressions do not require quoting, see above.
<v:h arr="{key: '{@complex expression etc. with whitespace}'}" /> # but ones with whitespace of course do.

Mustace tolerance

By sacrificing a single and very rarely used Fluid syntax it is possible to ignore any Mustace syntax bits,
which has traditionally been quite challenging requiring syntax-breaking tricks to be implemented.
{{variable}} # Is currently the equivalent of PHP $$variable (if $variable = "foo", this references $foo)

Solid CDATA escaping

Due to the sequential nature of a lexer, it is much, much easier to allow CDATA to be used for escaping any
amount of Fluid code. Once the beginning of a CDATA section is encountered, lexing can switch to a state where
everything is read as simple text until the end of the CDATA section.
This also opens the possibility of allowing PCDATA to be semi-parsed, for example allowing the use of inline
syntax while ignoring tag mode, or only allow variables and expresions but no ViewHelpers in such blocks.
All X(HT)ML tokenised

Even tags which are not Fluid ViewHelpers will yield a token, allowing the parser to choose what to do with
a given tag or body of a given tag. This allows the parser to be aware of also the HTML, and for example throw
errors if X(HT)ML tags are not properly closed, and to be aware of the context - such as whether an inline
ViewHelper call was used as attribute in an X(HT)ML tag or a ViewHelper (and for example use selective escaping
of the output values in one case but not the other).
Conclusion

What is the point of all this? Well, it has for many years been a vision of mine to not use regular expressions
in Fluid when parsing. Regular expressions were never intended for parsing XML - and especially not intended
for parsing very big XML documents. They come with several limitations and problems:

Because Fluid requires back-references and recursion, it is possible to write syntax that causes infinite loops.
There is a limit to how many characters can be detected by an expression sub-part, which can for example cause
big array syntax pieces to simply not be detected.
It is excessively difficult to understand the complex expressions that are used for matching Fluid code, even
with the very friendly annotations that are added.

In comparison, a lexer based solution has the benefit that it reads everything in sequence and yields a sequence
of tokens. The sequence then completely decides how the code gets processed. For example, a ViewHelper call in
which other ViewHelpers are called to create arguments, yields a token sequence containing a sub-sequence that
the parser must then handle. This also means that the token sequence can be more precisely validated: each token
can only be followed by one or more other tokens - the lexer knows which sequences are valid, the parser can then
query the lexer to validate the sequence (or not, depending on use case; e.g. skipping sequence validation in
production environments for a speed boost).
So this is, in my opinion, something worth researching. It may not be feasible to achieve all of the above, but
even the little research project is already capable of lexing all but a few of the described cases (in addition
to, obviously, being able to lex the normal Fluid syntax).
The really good part is that because the API of Fluid is open the way it is, switching to an improved lexer
(once it is stable enough!) may be as easy as adding a composer dependency.
But for now, consider this a vision backed by research; a sign of what may come :)