Skip to content

Instantly share code, notes, and snippets.

@GerHobbelt
Last active August 29, 2015 14:28
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save GerHobbelt/6d05be6d1643c792b91e to your computer and use it in GitHub Desktop.
Save GerHobbelt/6d05be6d1643c792b91e to your computer and use it in GitHub Desktop.
Response to PEGjs question in the mailing list -- big for a response; blog entry size but still lacking context to be able to stand on its own

Note: I'm not native English speaking and don't have the natural language processing jargon down pat, so the 'preceeding' vs. 'proceeding' I assume the PREceeding text before 'x' in 'zzxz' is the leading 'zz' ('history') and the PROceeding text for the same is the trailing 'z', hence 'proceeding text' would be similar to 'look ahead' as it is called in computer language parsing (unambiguous language parsing).

Anyway... I'm sure I don't get everything you say but there's at least one subject which is certainly relevant here:

handling/coping with history (preceeding input) vs. look-ahead (future, incoming, proceeding input)

In your sample grammar and sample 'zzxz' I assume each character is a single token. (Yes, PEG always processes characters as it blends lexer and parser into a single specification language; I'm just sufficiently dinosaur to appreciate the difference between 'character stream' and 'token stream': the token stream is where such nice horrors as left recursion are to be expected in my mind.)

Now to your problem: what I observe is that you only look back, i.e. you only inspect your parse history. Let's go through it your way (or rather what I believe is your thought process here)

When you look at 'x' and 'know' that your parse history is 'zz' and you're spot on, but the problem starts with the question: how do you get to 'x' in the first place? How did your machine 'travel' through this input stream, guided by your grammar definition?

Recalling your sample grammar ruleset:

Let's give an example with three rules that could appear:

  1. Replace 'x' with 'y', after 'zz'
  2. Replace 'z' with 'q', after 'z'
  3. Replace 'z' with 'w'

we step through, one token (here also: character) at a time (please hold your breath if you see that the next section isn't 'PEGgish behaviour'! I am going through this at a slow pace as this is some very fundamental stuff to understand and recognize that both your (OP) way and PEG parsing are each valid, but quite different!):

  1. we always start at the left edge, hence we first look at the initial 'z' and see if anything matches. If not, we push the token and fetch the next, again checking the ruleset in hopes of hitting a rule match.
    In the sample grammar case we immediately hit a rule: rule 3.
    Hence 'zzxz' -> 'z' --> 'w' (the mix would thus be: 'wzxz')

  2. we now look at the next 'z' and 'know' that our history is one 'z' already parsed and processed: does 'z', with history 'z' match any rule other than rule 3?
    Yep, rule 2!
    hence the new mix result goes from 'wzxz' --> 'wqxz'.

  3. now we inspect the third token: 'x', with history 'zz':
    same question applies: any rules match this one and if so, which is the first (highest priority) rule to match?
    rule 1.
    Thus a preliminary result mix goes from 'wqxz' --> 'wqyz'

  4. the last 'z' token/character. Same game as before ==> rule 3 ==> 'wqyz' --> 'wqyw'

And that MAY be expected as the grammar transformation output IFF you coded the transformation process as a set of independent rules (the grammar parsing process is another; both are often mixed as the transformations may be coded in the grammar rules' 'action block' code sections. Coding all these rules in a single 'recursive PEG rule spec' is NOT making these rules independent per se:

// sample grammar; '//' is comment till end-of-line
//
// note that I call a rule 'transformation' here to emphasize the process; 
// implicitly I assume a few things:
//
// - every token is processed exactly once (hence I can say that a token 'is' a 
//   transformation and thus my brain picks this odd 'transformation' title as 
//   the choice for a rule)
//
// - next to the three sample rules, there's a 4th: the nil transformation: 
//   if you hit none of the above 3, you simply copy the token
//
// - we track our input history separately (PEG does NOT do that for us, hence 
//   we will need to provide this ability)

language: transformation+

transformation:
  // rule 1
  history('z' 'z') 'x'     { replace_with('y') }
/ 
  // rule 2
  history('z') 'z'     { replace_with('q') }
/ 
  // rule 3
  'z'     { replace_with('w') }
/ 
  // rule 4: the implicit copy rule
  .     { copy() }
;

Using a grammar like this (which is NOT A PEG GRAMMAR: we use the non-PEG 'history' preceeding-input test here) will transform 'zzxz' to the above-constructed transform output: 'wqyw'.

What is not-PEG about the above grammar and/or way of thinking?

A few key things:

  • you look at the history, which is reasonable to do, but language parsers, and not just PEG grammars, are meant to look into the future (we'll get to that shortly)

    • You used the & predecate operator to check on history in your own example but the PEG predecates are meant to look into the future -- I used the history operator, which is not part of the PEG language, as the intended match operation. If you want special history checking, you can emulate these using custom code and &{...} usercode predicates instead. It won't be nice 'n simple, though, as you will need to do some history stream tracking of some sort... ;-)
  • I said the transforms as described would work that way if they were considered independently, in a way where only their priority in the specification matters. Regrettably it doesn't work like that in PEG (or similar parse engines):

    • The way described above re-considers the parsed history again and again: we inspected the history for each rule match attempt.

      PEG et al consume/transform history i.e. parsers do not keep track of the historic took stream. Instead, when one or more tokens match a rule, they are 'transformed' into becoming the rulename itself.

      In other words, when you parse and transform the rule 3 single 'z', that 'z' is then gone as a token and from that point in time onward is thought of as being an instance of the rule 'transformation' itself. (You'll see what I mean when we re-run the example ruleset below.)

look into the future and forget the past

Let's convert this to something more suitable for PEG and friends and start with the first key thought process change: we now try to look into the future.

Regrettably we have a rule set which cannot be simply translated to an equivalent where we look at what will come after, so we're immediately stuck.

Hence we table this one for later and take on the other key item:

PEG et al consumes/transforms input on a per-rule basis

We simply take the grammar above, throw away the bits which are non-PEG and look what happens:

Recalling your sample grammar ruleset:

Let's give an example with three rules that could appear:

  1. Replace 'x' with 'y', after 'zz'
  2. Replace 'z' with 'q', after 'z'
  3. Replace 'z' with 'w'

with our hacked-to-PEG grammar spec:

// sample grammar; '//' is comment till end-of-line
//
// `history()` is gone the way of the dodo: now we recognize = parse those tokens
// on the spot.
// 
// We'll see what gives...

language: transformation+

transformation:
  // rule 1
  'z' 'z' 'x'     { replace_with(3, 'y') }
/ 
  // rule 2
  'z' 'z'     { replace_with(2, 'q') }
/ 
  // rule 3
  'z'     { replace_with(1, 'w') }
/ 
  // rule 4: the implicit copy rule
  .     { copy() }
;

Using this grammar, where the user-specified replace_with() function now has an extra argument so we can spec which part of the rule is going to be transformed, we'll observe this behaviour:

  1. we always start at the left edge, hence we first look at the initial 'z' and see if anything matches.
    In the sample grammar case we immediately hit a rule: rule 3.
    Hence 'zzxz' -> 'z' --> 'w'
    Assuming we perform the transformation in a copy of the input stream, that copy will transform thus: 'zzxz' --> 'wzxz'.

Also note that the transformation rule now terminates for the first time: the language: transformation+ rule above now receives the first (of many) transformation results, which are accessible in the language rule action block (there is none there, alas, but then again the transformation rule does not yet return any useful value so there's little to accomplish here anyway).

  1. we now look at the next 'z' and 'know' that our history is one transformation

** z got parsed by the transformation rule, hence the parser, looking to parse the entire language, has now received on transformation rule result for the top transformation+ rule -- but we're not done yet **

PEG only keeps this 'history' in the language rule as the previously matched 'z' became a transformation result and is now part of the transformation+ under construction. So in a sense one might say that PEG (and other systems similar to it) do not track history. Indeed there's no token trace; only as long as you're part of a rule match action in the rule tree do you have the language information stored in some sort of form in there.

In short: forget about history!

Anyway, we still have that second 'z' to parse and the first hit this time around is: rule 3. So once again 'z' --> 'w': Assuming we perform the transformation in a copy of the input stream, that copy will transform thus: 'wzxz' --> 'wwxz'. Another transformation rule result is obtained in language and matching transformation+. We continue.

  1. now we inspect the third token: 'x':
    same question applies: any rules match this one and if so, which is the first (highest priority) rule to match?
    rule 1 doesn't fly as that rule says you need at least two more z at the same time and we already ate those in the previous transformation rule rounds! Ouch!

Thus we end up at the copy rule: rule 4

Notice: PEG does not have implicit rules of any kind; I added that rule explicitly before to ensure we'll get to something which might work out, in PEG, at the end

Assuming we perform the transformation in a copy of the input stream, that copy will transform thus: 'wwxz' --> 'wwxz'. Another transformation rule result is obtained in language and matching transformation+. We continue.

  1. the last 'z' token/character. Same game as before ==> rule 3 ==> 'wwxz' --> 'wwxw'

After this we terminate. (You might want to look into ways of EOF (End-Of-File) termination in PEG as you need to specify that explicitly in your (top) grammar rule, but that's for another time to pay attention to; this is getting long and complicated enough...)


Okay, that kinda works (we'll sure have to fiddle a bit here and there to make it actually compile in PEGjs but this is the gist of it).

Can't we improve a bit on this, just to strengthen the understanding of how a PEG top-down parser actually works? Yes, we can: we can use the artifact that every rule match produces a result:

// sample grammar; '//' is comment till end-of-line
//
// We now *use* the rule-match-produces-a-rule-result in the grammar rule action code blocks.
// Otherwise no change from the previous incantation.
// 
// We'll see what gives...

language: a:(transformation+)   { return a.join('') }

transformation:
  // rule 1
  'z' 'z' 'x'     { return 'y' }
/ 
  // rule 2
  'z' 'z'     { return 'q' }
/ 
  // rule 3
  'z'     { return 'w' }
/ 
  // rule 4: the implicit copy rule
  token:.     { return token }
;

Using this grammar, where the user-specified replace_with() function is replaced by a simple return statement, we'll observe this behaviour and comment a bit on the semantic effect of this edit/change:

  1. we always start at the left edge, hence we first look at the initial 'z' and see if anything matches.
    In the sample grammar case we immediately hit a rule: rule 3.
    Hence 'zzxz' -> 'z' --> 'w' (fictional 'zzxz' copy becomes 'wzxz')

Also note that the transformation rule now terminates for the first time: the language: transformation+ rule above now receives the first (of many) transformation results (value: 'w'), which are accessible in the language rule action block where we concatenate the returned rule results into an output string (which would then be identical to our final 'fictional copy' mentioned above).

  1. we now look at the next 'z' in a new invocation of rule transformationand since this is exactly as in the previous 1-2-3-4 test run further above, there's nothing different from back then and we thus end up at rule 3 matching for the second time in here.

So once again 'z' --> 'w':
The fictional copy will transform thus: 'wzxz' --> 'wwxz'.
As before, this transformation rule's result is also collected by transformation+ in the top-most language rule and we thus end up with a preliminary array of ['w', 'w']. We continue.

  1. now we inspect the third token: 'x' and this becomes very boring indeed.
    There's no match except the rule 4: the copy rule, as we forgot about our parse history (we cannot access the previous matches inside previous transformation rule invocations as this invocation activity is simply calling (and returning from) functions while using the regular call stack, so previous calls' internals to the same are lost forever.

Thus we end up at the copy rule: rule 4, where we return the matched token (an arbitrary single character) which happens to be our 'x', thus the transformation+ collection in the parent rule now becomes the new preliminary set ['w', 'w', 'x'].

The fictional copy will transform thus: 'wwxz' --> 'wwxz'.

  1. the last 'z' token/character. Same game as before ==> rule 3 ==> fictional copy: 'wwxz' --> 'wwxw'

After this we terminate. (Again, we remind you to look into EOF signaling in the PEG grammar rules, but that's another (minor) subject.)
Thus transformation+ will carry the array ['w', 'w', 'x', 'w'] and the language action code block will produce the parser result via .join(''):

wwxw

Before we go on with the next stage of our development, we do note that the new action code blocks are quite different from the previous version further above: the rule1 subrule for instance now returns a single character string 'y' as a transformation representing

Reconsidering the first item: how can looking into the future help us here? Predicates at work

Predicates in PEG (and ANTLR, etc.etc.) are meant to test the future without consuming it; in other words a predicate MAY be thought of as a rule (section) which runs a parse on that predicate as if it were a 'subrule' and checks for a match but otherwise does not 'include' the parse result in the rule match process itself.

The simplest way to transform your original way of 'look back into history' way of thinking into one where your rules 'look into the future' is to treat each entire rule as a predicate and take it from there:

// sample grammar; '//' is comment till end-of-line
//
// We now use predicates representing each original syntactic rule. 
// This grammar isn't otherwise ready yet as we have done something nasty to make 
// each rule match a single token so that we end up with a ruleset which is **safe** 
// as each rule will consume a single token.
//
// No further thought on the rules has been spent yet, this is just to show 
// the predicate alteration. We're *growing* a grammar spec here, folks...
//
// We'll see what gives...

language: a:(transformation+)   { return a.join('') }

transformation:
  // rule 1: match the predicate, eat one token, observe what happens. :-)
  &rule1 .        { return 'y' }
/
  // rule 2
  &rule2 .        { return 'q' }
/ 
  // rule 3
  &rule3 .        { return 'w' }
/ 
  // rule 4: the implicit copy rule
  token:.         { return token }
;

// our first stab at predicates for 'into-the-future-looking' rules which were 
// originally formulated as looking into history.
// Do note that we predicate on BOTH HISTORY PLUS PRESENT to construct a viable 
// FUTURE predicate here.
// Run this thought-process without the PRESENT (i.e. last token in each rule) 
// and see Chaos at close range: *fun* results will happen to y'all!  :-)

rule1:
  'z' 'z' 'x'     { return true }
;

rule2:
  'z' 'z'         { return true }
; 

rule3:
  'z'             { return true }
; 

Using this grammar we'll observe this behaviour and comment a bit on the semantic effect of this edit/change:

  1. we always start at the left edge, hence we first look at the initial 'z' and see if anything matches. In the sample grammar case we test the rule 1 predicate, get pass on the predicate check, so we consume a single character and return 'y', which is still not what we had intended :-((((((((

  2. for the second 'z' we fail on the first predicate 'zzx' but pass on the second predicate check 'zx' and thus, again wrongly, return 'q' upon consuming the single 'z' via the '.' dot rule bit.

  3. now we inspect the third token: 'x' and we fail both 'zzx' and 'zx' predicates. We also fail the third rule's 'z' predicate check as we are 'x', not 'z', hence we end up at the copy rule 4.

  4. the last 'z' token/character fails predicates 1 and 2 but passes rule3's predicate, producing 'w'.

    yqxw

which is again very wrong indeed as we want

wqyw

This does not lead to a good design. BAD BAD BAD

Hence my idea when I started writing this is not going to fly -- brain error there. :-(

So the next thing that comes to mind (or rather the very first which came to mind before) is extremely ugly and nasty and fundamentally is adding token parse stream tracking in the action code blocks plus special &{...} action code-based predicates usage where custom user code checks the history tape.

Something like that but this is very rough notes only as I now see that I must rethink my approach. !@#$%^*()!

Another approach -- **which is very prone to issues and wouldn't produce (tongue in cheek) the most maintainable of grammars anyway -- is to manually (or mechanically) deduce all rule interrelations and encode them in a forward-looking grammar like this:

// sample grammar; '//' is comment till end-of-line
//
// forward-looking grammar, which should transform each character exactly once, 
// hence keep the same fundamental behaviour as the original non-PEG grammar.
//
// Not a happy maintainer this grammar will have. (When you get Yoda-speak in 
// a grammar file you know you're *toast* for real!   |:-((   )
//
// - every token is processed exactly once (hence I can say that a token 'is' a 
//   transformation and thus my brain picks this odd 'transformation' title as 
//   the choice for a rule)
//
// - next to the three sample rules, there's a 4th: the nil transformation: if 
//   you hit none of the above 3, you simply copy the token

language: transformation+

transformation:
  // rule 1 is rule3 + rule2 + rule1:
  rule1 rule2 rule3             { return rule1 + rule2 + rule3 }
/
  // rule 2 is rule3 + rule2
  rule3 rule2                   { return rule2 + rule3 }
/ 
  // rule 3
  rule3                         { return rule3 }
/ 
  // rule 4: the implicit copy rule
  token:.                       { return token }
;

// These rules are NASTY as they assume you, the programmer/architect, has made *sure* that
// the preconditions (old: history checks) are met before *entering* each and any of these
// rules!  EEEEEEEEEEEEEEEEEEEEEEEEEEK!

rule1:
  'x'     { return 'y' }
;

rule2:
  'z'         { return 'q' }
; 

rule3:
  'z'             { return 'w' }
; 

The above grammar should work (I think; it's not tested, like the others it's thought experiment only!) but you are in for some real trouble when you have history-dependent transformations which overlap one another, e.g.

Let's give an example with three rules that could appear:

A. Replace 'x' with 'y', after 'zz' B. Replace 'z' with 'q', after 'zx' C. Replace 'z' with 'w', after 'xz'

Try that one on for size:

// sample grammar; '//' is comment till end-of-line
//
// forward-looking grammar, which should transform each character exactly once, 
// hence keep the same fundamental behaviour as the original non-PEG grammar.
//
// Not a happy maintainer this grammar will have. (When you get Yoda-speak in 
// a grammar file you know you're *toast* for real!   |:-((   )
//
// - every token is processed exactly once (hence I can say that a token 'is' a 
//   transformation and thus my brain picks this odd 'transformation' title as 
//   the choice for a rule)
//
// - next to the three sample rules, there's a 4th: the nil transformation: if 
//   you hit none of the above 3, you simply copy the token

language: transformation+

transformation:
  // rule A is zzx->y but did that first z come after another 'z' or did it follow an 'x'?
  // WE ARE TOAST! We HAVE INFINITE LOOK-BACK instead of infinite look-ahead (which is about
  // the same level of stress but at least PEG et al provide in-built means 
  // to handle the *latter* :-( ):
  . . .             { return ahh.... }
/
  // rule B: same story: cyclic history dependencies. I can't see a way out right now,
  // other than to dial down on the use of PEG-based rule testing and reside to user-code
  // custom checks.
  errr.........................
/ 
  // rule C
  ...ehhhhhhhhhhhh................'
/ 
  // rule 4: the implicit copy rule
  token:.                       { return token }
;

OP message:

On Tue, Aug 25, 2015 at 2:27 AM, Jadin wrote:
Ello!

I found this great tool while looking for language parsers (I'm working on a computational linguistics project for parsing & applying phonological rules), but I'm completely stumped on how to avoid my left-recursion cases. I was hoping someone could point out how to approach designing grammars while avoiding left recursion.   
Specifically, I am transforming text by passing the text through rules. The rules are all capable of having 4 components:

1. Old Text: The text to change (replace) in the input text
2. New Text: The text to insert in place of the old text, in the output text
3. Preceding Environment: The text that precedes the old text in the input text
4. Proceeding Environment: The text that proceeds the old text in the input text

These rules are all optional. If there is no Old Text, then the New Text is inserted between the preceding & proceeding environments. If there is no New Text, then the Old Text between the pre/pro environments are simply deleted.

These rules are user-generated. I actually use a static PEG.js parser to parse the user-supplied shorthand for these rules into a PEG.js grammar string, which I then use to generate the parser to parse text.

Usually things are straight forward - if each character/group of characters in the input string is always transformed into the same output character(s), everything works great. There is no need to examine the environment around the parser position.

However, I have not been able to get preceding environments to work, in a general sense.
Let's give an example with three rules that could appear:
1. Replace 'x' with 'y', after 'zz'
2. Replace 'z' with 'q', after 'z'
3. Replace 'z' with 'w'

And we are given the string 'zzxz'.

As I currently understand it, we try to match rule 1 first, because the preceding environment is longest. So, at the first 'z', we match for rule 1. We could return the New Text immediately. That gives us:
'zzyw'
Which does not conform to our rules! We want the 'zz' to be parsed as well. This an example of the more general issue that these preceding environments are in danger of being unparsed.

The only fix I've found is to conditionally match [ "&('z' 'z')" for rule 1 ] the preceding environment, and then, inside of rule 1 before matching the Old Text, recursively try to match all other existing rules on characters before the Old Text. An example from a grammar:
`
rule1
= a:( &(( "s" "s" ) 'a' ) rule1x ) { return a.join(''); }

rule1x
 = a:(
 rule0
/ rule2
/ rule3
/ rule4 ) 'a' {return a + 'xx'; }
`
This rule in shorthand:
- Replace "a" with "xx" after "ss"

However, this still results in cases of left recursion that is caught by the parser. Going back to the 3 toy rules, suppose that we did something similar to the example rule and recursively called the other rules on the matched previous environments.
We would have an infinite loop, because the rule 1 would match rule 2, which would match rule 1, and etc.

How should I re-think all of this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment