Skip to content

Instantly share code, notes, and snippets.

Last active June 25, 2018 13:12
Show Gist options
  • Save NamelessCoder/a81e1682eae7ba148a756bdd5b74a473 to your computer and use it in GitHub Desktop.
Save NamelessCoder/a81e1682eae7ba148a756bdd5b74a473 to your computer and use it in GitHub Desktop.
Research into Lexer-based Fluid tokenisation for improved syntax

The following are examples of new syntax capabilities that will/may become possible if switching Fluid away from regular expression parsing, to a lexer yielding a stream of tokens that can then be parsed to become a Fluid "syntax tree".

The lexer is a research project, in progress, but nearing completion. I plan to combine it with a "streaming node parser" which only processes the tokens it must, as opposed to processing everything like the current Fluid parser does.

Some of these already work - most of them are being researched

The examples

Recursive accessor nodes

Inline expressions, including ViewHelper calls, can be nested to any depth and does not require quoting the variable accessor when using dynamic parts in VieWHelper arguments.

- {variable.{sub}}
- {variable.{v:h()}}
- {v:h(arg: variable.{sub})}
- {v:h('{v:key()}': value)}

Arbitrary inline or tag mode arguments

Inline syntax supports the same method of passing arguments as tag mode and supports value-less referencing:

- {v:h(arg="value")}
- {v:h(arg="{value}" arg2="{value2}")}
- {v:h(arg=value arg2=value2)}
- {v:h(arg arg2)} // Passes variables "arg" and "arg2" as values for arguments "arg" and "arg2"

Optional arguments separator for arrays

The , arguments seperator can easily be made optional when using a lexer:

{v:h(arg: value arg2: value)} # ...but of course at this point it makes more sense to use:
{v:h(arg="{value}" arg2="{value}")} # since this is a bit more readable at the cost of needing quotes,
{v:h(arg=value arg2=value)} # or this variant which means "the value is a variable reference"

Using the same syntax for both inline and tag mode makes it much easier to convert a ViewHelper call from inline to tag notation and vice versa, so this is also a goal of using the lexer.

New array syntaxes

New ways of passing an array of values can be added:

<v:h arr="[a, b, c]" /> // creates: ["a" => $a, "b" => $b, "c" => $c]

A new "inline pass" operator

Since lexers work best when analyzing a single character at a time, the current -> inline operator is less than ideal. It also has the possibility to be confused with part of an XML tag. Switching the inline pass operator to | solves this:

{variable -> v:h()} same as {variable | v:h()} but aiming to deprecate the former.

Reducing need to quote ViewHelpers

Currently you have to quote (in single or double quotes) any calls to ViewHelpers when you use inline syntax while building an array. Quoting, and in particular the need to escape quotes, can be reduced by not requiring ViewHelper calls to be quoted; but only when building arrays (does not make sense elsewhere):

<v:h arr="{key: v:h()}" />

Forbidding tag mode in attributes

Using tag mode in attributes can be forbidden, ensuring developers will write valid templates.

<v:h arg="<v:h2 />" /> # throws error
<v:h arg="{v:h2()}" /> # does not

Reducing redundant syntax parts

It is possible to remove the need to add () to inline ViewHelper calls when said ViewHelper call has no arguments. Instead, a shorter syntax can be made possible:

{v:h(arg: v:h)}
{variable -> v:h}
{variable | v:h} # see above!

Expression marker

In order to clearly identify expressions (such as {variable as array}) as different from normal variable accessors or inline ViewHelper syntax, a small helper character can be used:

{@complex expression, (), [] etc. captured, only terminates by curly brace}
<v:h arr="{key: @complex-expression}" /> # whitespace-less expressions do not require quoting, see above.
<v:h arr="{key: '{@complex expression etc. with whitespace}'}" /> # but ones with whitespace of course do.

Mustace tolerance

By sacrificing a single and very rarely used Fluid syntax it is possible to ignore any Mustace syntax bits, which has traditionally been quite challenging requiring syntax-breaking tricks to be implemented.

{{variable}} # Is currently the equivalent of PHP $$variable (if $variable = "foo", this references $foo)

Solid CDATA escaping

Due to the sequential nature of a lexer, it is much, much easier to allow CDATA to be used for escaping any amount of Fluid code. Once the beginning of a CDATA section is encountered, lexing can switch to a state where everything is read as simple text until the end of the CDATA section.

This also opens the possibility of allowing PCDATA to be semi-parsed, for example allowing the use of inline syntax while ignoring tag mode, or only allow variables and expresions but no ViewHelpers in such blocks.

All X(HT)ML tokenised

Even tags which are not Fluid ViewHelpers will yield a token, allowing the parser to choose what to do with a given tag or body of a given tag. This allows the parser to be aware of also the HTML, and for example throw errors if X(HT)ML tags are not properly closed, and to be aware of the context - such as whether an inline ViewHelper call was used as attribute in an X(HT)ML tag or a ViewHelper (and for example use selective escaping of the output values in one case but not the other).


What is the point of all this? Well, it has for many years been a vision of mine to not use regular expressions in Fluid when parsing. Regular expressions were never intended for parsing XML - and especially not intended for parsing very big XML documents. They come with several limitations and problems:

  • Because Fluid requires back-references and recursion, it is possible to write syntax that causes infinite loops.
  • There is a limit to how many characters can be detected by an expression sub-part, which can for example cause big array syntax pieces to simply not be detected.
  • It is excessively difficult to understand the complex expressions that are used for matching Fluid code, even with the very friendly annotations that are added.

In comparison, a lexer based solution has the benefit that it reads everything in sequence and yields a sequence of tokens. The sequence then completely decides how the code gets processed. For example, a ViewHelper call in which other ViewHelpers are called to create arguments, yields a token sequence containing a sub-sequence that the parser must then handle. This also means that the token sequence can be more precisely validated: each token can only be followed by one or more other tokens - the lexer knows which sequences are valid, the parser can then query the lexer to validate the sequence (or not, depending on use case; e.g. skipping sequence validation in production environments for a speed boost).

So this is, in my opinion, something worth researching. It may not be feasible to achieve all of the above, but even the little research project is already capable of lexing all but a few of the described cases (in addition to, obviously, being able to lex the normal Fluid syntax).

The really good part is that because the API of Fluid is open the way it is, switching to an improved lexer (once it is stable enough!) may be as easy as adding a composer dependency.

But for now, consider this a vision backed by research; a sign of what may come :)

Copy link

masi commented Jan 12, 2018

<v:h arg="<v:h2 />" /> # throws error

this isn't valid html/xml, so this one should be written as such to be valid

<v:h arg="<v:h2 />" />

Of course in this case the argument will be "<v:h2 />" as the entitiy decoding must be done as the very first step.

Copy link

masi commented Jan 12, 2018

<v:h arr="[a, b, c]" /> // creates: ["a" => $a, "b" => $b, "c" => $c]

Why use square brackets? ECMA6 does well with curly brackets.

IMHO it would be more useful to allow an alternative {} syntax and use [] for lists (sparse arrays)

<v:h arr="{a, b, c}" /> // creates: ["a" => $a, "b" => $b, "c" => $c]
<v:h arr="{a, x:b, c}" /> // creates: ["a" => $a, "x" => $b, "c" => $c]
<v:h arr="[a, b, c]" /> // creates: [0 => $a, 1 => $b, 2 => $c]
<v:h arr="[a, x:b, c]" /> // creates: [0 => $a, 'x' => $b, 1 => $c] - or shoud be the 2 key for $c ???

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment