Skip to content

Instantly share code, notes, and snippets.

@alabamenhu
Last active November 21, 2021 15:32
Show Gist options
  • Save alabamenhu/2fec7a8f51a24091dc1b104a2ae2f04d to your computer and use it in GitHub Desktop.
Save alabamenhu/2fec7a8f51a24091dc1b104a2ae2f04d to your computer and use it in GitHub Desktop.
Grammar proposal

This is designed to be a working proposal. Comments/corrections/suggestions are welcome, as the first draft was written fairly hastily. I'm working on doing a rough implementation to play around with, beginning with the Binex proposal, which can be found here. It is not currently a full implementation of this proposal, but progressing rapidly.

Background

Grammars in Raku are awesome, and allow for some truly amazing text parsing.
Unfortunately, they are less than ideal for binary files, and there is no way to have them support matching objects, both of which would be very useful (image being able to pattern match on an AST!) This requires writing complex and/or error prone workaround code.

Goal

Create an easy-to-use, Raku-ish way to handle grammars that are binary or objecty. Practically speaking, these are two separate proposals, and will likely involve different optimizations, but are treated together so that their end-user solutions are as similar as posisble, e.g., saying Grammar is binary or Grammar is objecty and then modifying the interpretation of the tokens to a regex-like slang.

Proposal

Binary

A basic proposal binary grammar would look something like this:

grammar UTF-8 is binary[8] {
  token TOP { <byte-order-mark>? <utf-8-glyph>* }

  token byte-order-mark        { xEF xBB xFF }
  
  proto token utf-8-glyph { * }
  
  token utf-8-glyph:single     { b0.......                              }
  token utf-8-glyph:double     { b110..... b10......                    }
  token utf-8-glyph:triple     { b1110.... b10...... b10......          }
  token utf-8-glyph:quadruple  { b11110... b10...... b10...... b10......}

  proto token utf-8-stream { <byte-order-mark> <utf-8-glyph> * }
}

Where x00 represents a byte in hexademical, o0000 in octal, b00000000 etc. For simplicity's sake, each byte should be written out in full. Because some grammar definitions may benefit from it, while the default unit would be a byte, it might be useful to base the grammar not on a byte by byte sequence, but rather words of 16, 32, or 64 bits, enabled via parameterization (is binary[16]). In such cases, an underscore may delineate groups but is otherwise ignored, e.g. 0xffff_ffff for a 32 bit hex value, although 0xf_f_f_f_f_f_f_f would be theoretically valid too).

In a binary grammar, strings are considered invalid, either bare or quoted, although they could included via a method that returns a Blob (similar to a method that returns a string).

Alternatives can be given like in regular Regex, using | (LTM) or || (short circuit).

For character classes, I see two useful ideas:

  • <[ x00 .. x1f ]> would match values from 0 to 31.
  • <[ b.......1 ]> would match odd numbers.
  • <[ b.......1 b00000000]> would match odd numbers or 0.

The middle one, of course, would seem to be pointless given a bare b.......1 would be valid, but when used as a negator, it could be a fair bit more powerful, where < +[x80 .. xff] -[b.......1]> would represent all odd upper ASCII values. I think it would be optimal and not particularly complex to allow a construction like o00.0 .. o04.8 and treating it similar to the string range, e.g., 00.0, 00.1 … 00.8, 01.0, 01.2 with the dot preserved as a wildcard in all. An optimization stage can try to determine if there's a compact representation < +[x80 .. xff] -[b.......1]> becomes b1..._...1, and if not, fall back to a sequential test.

Inline Binex

For use in inline situations, all of the // syntax would be available but adding on as an option :bin:

  • match: m:bin:options/find/
  • substition: s:bin:options/find/replace/
  • substition (nondestructive): S:bin:options/find/replace/
  • transliteration: tr:bin:options/swap/swap/options
  • transliteration (nondestructive): TR:bin:options/swap/swap/

Split bytes

One issue that seems odd, but with real world use, would be to allow captures/tokens betwixt bytes/words. In the aforementioned Zelda 3 article, the format would effectively be for us:

grammar is binary[8] {
  token TOP          { <chunk>+? <end>                       }
  token end          { xFF                                   }
  token chunk        { <header> <data: +$<header><length> >  }
  token header       { b..._.....                            }
  token data($count) { x.. ** {$count}                       } 
}

The catch, however, is how to handle the splitting up of header into the command (first three bits) and the length (latter five bits). I'm not sure what the best syntax to use here would be. No doubt there are other formats where a sub byte item might even be repeated. In this particular case, a work around could be to say

grammar is binary[8] {
  token TOP   { <chunk>+? <end> }
  token end   { xFF }
  token chunk {
    my $*cmd; 
    my $*length; 
    <header> 
    <data: $*cmd, $*length>
  }
  token header { 
    b..._..... { 
      $*cmd    = +$¢ +> 5;
      $*length = +$¢ +& 31 + 1; # length 0 is 1
    }
  }
  
  enum ( Copy => 0, ByteRept => 1, WordRept => 2, ByteIncr => 3, CopyExst => 4);
  
  multi token data(              Copy , $count) { x.. ** {$count} } 
  multi token data(ByteRept | ByteIncl, $count) { x..             } 
  multi token data(WordRept | CopyExst, $count) { x.. ** 2        } 
}

While that would work, it seems inelegant (and making it impossible to handle a token that ends in the middle of a byte/word). Instead, we'll provide an additional option of X and Z, where X means “bit I don't care about, shove it out of the way” and Z means “bit I don't care about, but want it zeroed out”.

The &/&& conjunctions are not commonly used in string-based grammars, but this could be a great place to use them with regularity. Because you could do (at least if needing to split a single byte):

grammar is binary[8] {
  token TOP   { <chunk>+? <end> }
  token end   { xFF }
  token chunk {
    [<cmd> && <length>]
    <data: $<cmd>.head, $<length>.head>
  }
  token cmd    { b..._XXXXX }
  token length { bZZZ_..... }
  
  enum ( Copy => 0, ByteRept => 1, WordRept => 2, ByteIncr => 3, CopyExst => 4);
  
  multi token data(              Copy , $count) { x.. ** {$count} } 
  multi token data(ByteRept | ByteIncl, $count) { x..             } 
  multi token data(WordRept | CopyExst, $count) { x.. ** 2        } 
}

The open question with this approach is what the match value of should be. To reduce the problem:

  token a { <x> & <y>     }
  token b { <x>   <x>     }
  token c { b....XXXX <x> }
  
  token x { b....XXXX     }
  token y { bZZZZ....     }

When matching b11110001 on token a, we'd want x to blobify to b00001111, and yto b00000001. But what would we want to a to blobify to? We have three options: the original match (b11110001), and either of the two captures (b00001111 or b00000001). The answer might seem obvious to just use the original match, but someone might want to do something like in token b, and when matching b11110001 b10100011 expect b to blobify to b00001111 b00001010.

Some off-the-top-of-my-head potential solutions, without regard for complexity of implementation and no particular order:

  • Only modify literals within a given token
    Token c above would blobify r-shifting the first byte, but leaving the second in place, but blobifying $<c><x> would reveal the modification specified in x
  • Create two different methods of blobifying.
    One would return a match directly, and the other would return the modified value (probably the direct match as default). The problem of token a would remain, though, as there would now be two modified values, and if there were a second junction, at least four, etc., with no clear way to distinguish them.
  • Scrap the idea entirely
    I don't really like this one, but I s'pose it's one solution.
  • Only certain tokens allow Z or X
    This could be done via a trait is scoured or with a different declarator. Those tokens would gain the ability to use Z and X in their definitions, but lose the ability to use &, && (| and || would not be affected, since they match one item). Those special tokens that include other special tokens will use the modified values in place, since the lack of & operands means we can guarantee no overlapped values. To use the match values, use a regular token, or as a one-off option, perhaps the syntax <,foo> which is currently otherwise invalid.

My test implementation doesn't yet handle the operators, so I've not had to deal with the question too much yet, but it's looming.

Non-aligned tokens

Perhaps in a later version of the standard (because of the complexity of the code to support it and no doubt speed implications), an optional trait "is maligned" (because O(fun) names) could be added later to allow for non-full-byte/word tokens, without compromising previous code.

Object grammar

The idea for the object grammar came to me when I was processing some part-of-speech tagged text. Each word was an object whose class looks something like this (simplified for this document).

class Word {
  has $.word;
  has $.lexeme;
  has $.part-of-speech;
  has $.number;
  has $.gender;
  has $.tense;
  has $.person;
}

For matching with objects, I think usurping the character class syntax, and hacking it a bit would provide a nice, generally clear syntax to allow for matching on types or attributes/values.

grammar ObjexMatcherSyntax {
  rule  TOP             { '<'  ~  '>' <match-container>+ }
  rule  match-container { '['  ~  ']' <match>            }
  rule  match           { <type>* ':' <arguments>        }
  rule  type            { <sign>      <typename>         }
  token sign            { '-' || '+'? }
}

Arguments would follow standard Raku syntax, with the following interpretations:

  • Positional arguments are smartmatched against the object (e.g. <[Int: 1]> would match an Int value of 1, and <[Int: * > 5, * %% 2]> would match all even Int values over 5 (6,8,10, etc, but not 1 or 7).
  • Attribute arguments are similar smartmatched against the object's attribute. So <[Rat: :denominator(1)]> would match only whole number Rats, and <[Rat: :denominator(1,2,4 ... *)]> would match any power-of-two denominator because smart matching a list checks to see if it contains the element.

It may be that adding the +/- syntax for the type is overkill, and it would be better to keep with only additives, using the pipe | that's used elsewhere in Raku (after all, if someone really wanted, they could define a subset that explicitly handled more complex types). That would greatly simplify the syntax. Thoughts?

Problems/questions

Maybe it's just for my initial use case, but I feel like the typical use case for an Objex would want quicker access to the values/attributes of matched objects. Maybe that's just me though. But it definitely presents a different usecase over strings. Rarely, if ever, do we care about the distinction between a character (single element) and a string (sequence) because Raku doesn't distinguish them. But when dealing with objects, such a distinction IS suddenly important as character : object :: string : list. For this reason, I think it might be a good idea to add an additional declarator to an Objex, which would be simply object (surprisingly and luckily, this is not used at all in the Raku spec!). The contents of the object would be identically to the selector described above (just without the arrow brackets, and only require brackets if more than one selector). Thus the custom declarators of an grammar is objecty would be:

  • objex: backtracking, Match contains a List/Seq
  • rule/token: synonymous in our case, Match contains a List/Seq
  • object: Match contains an object directly.

I suppose it's possible to avoid adding new declarators and just say rule = sequence, token = one off object, but a new concept deserves a new declarator to avoid confusion. Using this idea, assuming I wanted to identify a sequence as a valid noun + adjective sequence, I might do the following:

  grammar ModifiedNoun is objecty { 
    token TOP { 
      <noun>               # the base noun
      <adj-list:           # followed by adjectives that
        $<noun>.gender,        match the noun's gender and
        $<noun>.number>        match the noun's number
    }
    
    token adj-list($g = *, $n = *) {
      [
        <adj: $g, $n>+        # any number of adjectives that agree
        <list-coordinator>    # if there's a list, need an and/or at the end.
      ]
      <adj: $g, $n>      # an agreeing adjective
    }
    
    object noun { Word: :part-of-speech<noun> }
    
    object coordinator  {
      Word: 
        :part-of-speech<coordinator>
        :lexeme('y'|'o')   # only want and/or 
    }
    
    object adj($g = *, $n = *)  { 
      Word: 
        :part-of-speech<adjective> 
        :gender($g)   # default of Whatever matches all
        :number($n)
    }
  }

Without an object option, the TOP and noun tokens would be a bit messier:

  grammar ModifiedNoun {
    token TOP { 
      <noun>
      <adj-list: 
        $<noun>[0].gender,
        $<noun>[0].number
      >
    }
    token noun { 
      <[Word: :part-of-speech<noun>]> 
    }

which works, I guess, but just isn't as clean.

Inline Objex

For use in inline situations, all of the // syntax would be available but adding on as an option :obj:

  • match: m:obj:options/find/
  • substition: s:obj:options/find/replace/
  • substition (nondestructive): S:obj:options/find/replace/
  • transliteration: tr:obj:options/swap/swap/
  • transliteration (nondestructive): TR:obj:options/swap/swap/

Document history

  • Updated April 17th to discuss the class of the & operator with the X and Z values, and fixed a few other typos (/foo/bar/options isn't Raku, duh)
  • Updated April 9th to integrate bgills' excellent suggestions on X and Z and inline naming, and fixed typos
@b2gills
Copy link

b2gills commented Apr 10, 2021

I really, really like the idea about X and Z, as it's a great way for isolating data, and elegantly automates what would otherwise be boilerplate +& and +>/ ops. How do you imagine that X should be interpretted in these situations where it's not purely right-aligned?

Ignore the implementation of X and Z for the moment.
The idea was for what you might want the premade .ast/.made value to be.

0b1111_1111 ~~ m:bin/ bXXXX_.... /
    # equiv to X/Z --> 0b0000_1111
    # left shift   --> 0b1111_0000
0b1111_1111 ~~ m:bin / b..XX_XX.. /
    # right shift --> 0b0000_1111
    # left shift  --> 0b1111_0000
    # equiv to Z  --> 0b1100_0011
0b1111_1111 ~~ m:bin / bXX.._..XX /
    # right shift --> 0b0000_1111
    # left shift  --> 0b1111_0000
    # equiv to Z  --> 0b0011_1100

I feel like creating complicated rules of when it'd be left or right shift is going to create more headache for users than it's worth, so I'd be fine with catching it and giving an error that it can only be on the trailing edge of a byte/word. I can't imagine left shift being very commonly used like right shifting, but perhaps an L could be done just in case.

Like I said, ignore the implementation of X and Z.
The idea was for the high-level semantics.
(Torture the implementers on behalf of the users.)

The idea was for X to remove itself from the result, and for Z to set itself to zero.

That is X does not mean shift. It just happens that shifting is usually the inevitable result of having used X.

I only ever considered X shifting rightwards (smaller number). If someone wanted them to shift leftwards (larger number) I figured they could do it themselves afterwards. (It wouldn't be a common operation.)
My reasoning being that it would most likely be used as a number for further matching. Like a count of following things to match.

As an example of what I was thinking is the following.
(Pretend that a/b/c etc match the same as .)

/aaaa_aaaX/
# 0aaaa_aaa

/aaaX_bbbb/
# 0aaabbbb

/aaXb_bXcc/
# 00aabbcc

/aaZZXXbb/  or /aaXXZZbb/ or /aaXZXZbb/ or /aaZXZXbb/ or /aaZXXZbb/ or /aaXZZXbb/
# 00aa00bb

Note that that last example I wrote in 6 different but equivalent ways.
The code to produce the .ast (00aa00bb) from them could/should be exactly the same for them.


Now to specifics.

For some number of Xs on the right, a simple shift works. (Pseudo code)

/...._.XXX/
$_ +> X.count

It works because the bits fall off the end.

For some combination of X and Z on the right

/...._ZZXX/
/...._XXZZ/
/...._XZXZ/
/...._ZXZX/
/...._ZXXZ/
/...._XZZX/

#00.._..00

Since those have identical results:

You can either do two shifts

($_ +> (Z.count + X.count)) <+ Z.count
# ($_ +> 4) +< 2

or a shift and a mask

($_ +> X.count) +&  +~((2 ** Z.count) - 1)
# ($_ +> 2) +& 0b1111_1100

or

($_ +& +~((2 ** (X.count + Z.count)) - 1) ) +> 2
# ($_ +& 0b1111_0000) +> 2

The only thing that someone using it is concerned is that the result is what they expect. So any of the above is fine.

Any other code that ends up with identical results would also be fine.

For X in the middle

/...X_X.../

You will have to separate the parts, and combine them again later.

# /aaaXXbbb/
my $aaa = $_ +& 0b1110_0000; # mask
my $bbb = $_ +& 0b0000_0111;

my $result = ($aaa +> 2) +| $bbb # combine

Of course there are more ways to implement something like that.

For X on the left

/XX.._..../

This would have identical results as having used Z. At least that was my thoughts.

I can see how you came to a different conclusion to what I meant. I should have been more clear.

It would have moved something on it's left into those two spots, but there is nothing there to move so just zero it.

$_ +& 0b0011_1111

# or

$_ +^ 0b1100_0000

Simplified handling

The simple way to create code which ends up with the correct result (using the semantics I came up with) is to group parts together.

/ZXaa_ZZXX_bbZZ_XXcc/
#ZX
#  aa
#     ZZXX
#          bb
#            ZZ_XX
#                 cc


#   aa      bb     cc
#   \\      \\     ||
#    \===\   \\    ||
#        \\   ||   ||
# 0000_00aa_00bb_00cc

You would mask and capture the aa / bb / cc, shift them, and combine them after.

my $aa = $input +& 0b0011_0000_0000_0000;
my $bb = $input +& 0b0000_0000_1100_0000;
my $cc = $input +& 0b0000_0000_0000_0011;

$aa +>= 12 - 8; # from 12 to 8
$bb +>= 6 - 4;
$cc +>= 0 - 0; # nop

my $result = $aa +| $bb +| $cc;

# from __aa ____ bb__ __cc
#        \\      \\     ||
#         \===\   \\    ||
#             \\   ||   ||
#   to ____ __aa __bb __cc

The following is somewhat of an (untested) pseudo code for compilation of the action code.

my $bin-regex = 'ZXaa_ZZXX_bbZZ_XXcc';
$bin-regex .= trans( '_' => '' );

my @parts = $bin-regex.comb( / <[ZX]>+ | <:Ll + [.]>+ / );
# ["ZX", "aa", "ZZXX", "bb", "ZZXX", "cc"]

my $from-pos = 0;
my $to-pos = 0;
my @actions;

for @parts.reverse {
  when .starts-with('Z'|'X') {
    $from-pos += .chars; # skip the values from the input
    $to-pos += .comb('Z'); # move leftwards the output only for Z actions
  }
  default {
    push @actions, BinCap.new( :$from-pos, :$to-pos, count => .chars )

    # setup for next loop
    $from-pos += .chars;
    $to-pos += .chars;
  }
}

# this is the code that gets installed into the actions for the bin regex
# in the future it could be generated using RAKU-AST
my $action-code = -> $input {
  my $result = 0;
  for @actions -> $action {
    $result +|= $action.mask-and-shift($input)
  }
  $result
}

class BinCap {
  has $.from-pos;
  has $.to-pos;
  has $.count;

  method input-mask () {
    (2 ** $!count - 1) +< $!from-pos
  }

  method mask-and-shift ( $input ){
    ($input +& $.input-mask) +> $!from-pos - $!to-pos
  }
}

Also, should there be a way to, instead of zero-ing it out, one-ing something in? (or some other value). For binary, it'd be easy to say Y = fill with 1, but in higher bases, that wouldn't work. If it'd be useful, perhaps x...Z(f) to replace the fourth nibble with 0b1111.

I'm sure there is a better letter for it. (Perhaps something Unicodey?)

It would actually be fairly easy.

# /...._YYYY/
my $result = $input +| 0b0000_1111;

Re + inside of a byte/word, I can see how that might be useful, but visually throws me off some. E.g., does b11+_0_Z+ get interpreted as [b11+_0_Z]+ or b1[1+]0[Z+]? Logically, of course, the former wouldn't make sense since it'd allow for a split byte, but visually, it makes me think a few times what's intended.

b1[1+]_0_[Z+] of course, to match Raku regexes. (I think there should be a way to write b[11]+_0_[Z+].)

  match    →   result
1111_110Z
1111_1101  →  1111_1100
1111_1100 

1111_10ZZ
1111_1011  →  1111_1000
1111_1010
1111_1001
1111_1000

…

110Z_ZZZZ
1101_1111  →  1100_0000
1101_1110
1101_1101
…
1101_0111
…
1101_1100
…
1100_1100
…
1100_0000

…

1111_110Z
1111_1101 →  1111_1100
1111_1100

These would not match (needs at least two 1 bits)

1011_1111
1000_0000
1010_0000

Neither would this (no zero bit)

1111_1111

This also wouldn't match ( nothing for Z+ to match

1111_1110

Basically b1[1+]_0_[Z+] would mean match

  1. some number of 1 bits (at least 2)
  2. a 0 bit
  3. ignore the rest of the bits (at least 1), but in the result set them to 0

This is a bit more advanced thought, and may need to wait for a v2


After thinking on it, someone may want to collapse to the left sometimes.
So I may have a rethink on how X works. Or if it should even be using the letter X for it.

The only reason I used X is because texts about binary tend to use X for don't-care.

I then realized that I might want to either collapse or set to zero. So I redesignated X to be collapse, and Z for zero.
I hadn't thought on it much more than that.
I hadn't even thought about higher bases at all.

The second rule of language design is Larry I'm allowed to change my mind. 😄

It took many years and iterations to get where Raku regexes are currently. It would be unlikely that a binary version would be as well designed in the first try.

(Let's see if it takes another full year for me to respond this time.)

@alabamenhu
Copy link
Author

(Let's see if it takes another full year for me to respond this time.)

Ha, no worries. I put this on the back burner until RakuAST comes out. I really appreciate the detailed examples — they will give me plenty of food for thought.

Thankfully, I'm happy to take time with this. I'd rather get it right than do it fast.

@Skarsnik
Copy link

You should maybe release some working code so we can start tweaking with to see if we encounter issues/thing not nice :)

@alabamenhu
Copy link
Author

You should maybe release some working code so we can start tweaking with to see if we encounter issues/thing not nice :)

I have. See the test implementation. Development was put on hold pending RakuAST, so right now just tossing around syntax ideas / uses / needs. Probably in the next month or two I'll begin work again in earnest implementing it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment