Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Grammar proposal

This is designed to be a working proposal. Comments/corrections/suggestions are welcome, as the first draft was written fairly hastily. I'm working on doing a rough implementation to play around with, beginning with the Binex proposal, which can be found here. It is not currently a full implementation of this proposal, but progressing rapidly.

Background

Grammars in Raku are awesome, and allow for some truly amazing text parsing.
Unfortunately, they are less than ideal for binary files, and there is no way to have them support matching objects, both of which would be very useful (image being able to pattern match on an AST!) This requires writing complex and/or error prone workaround code.

Goal

Create an easy-to-use, Raku-ish way to handle grammars that are binary or objecty. Practically speaking, these are two separate proposals, and will likely involve different optimizations, but are treated together so that their end-user solutions are as similar as posisble, e.g., saying Grammar is binary or Grammar is objecty and then modifying the interpretation of the tokens to a regex-like slang.

Proposal

Binary

A basic proposal binary grammar would look something like this:

grammar UTF-8 is binary[8] {
  token TOP { <byte-order-mark>? <utf-8-glyph>* }

  token byte-order-mark        { xEF xBB xFF }
  
  proto token utf-8-glyph { * }
  
  token utf-8-glyph:single     { b0.......                              }
  token utf-8-glyph:double     { b110..... b10......                    }
  token utf-8-glyph:triple     { b1110.... b10...... b10......          }
  token utf-8-glyph:quadruple  { b11110... b10...... b10...... b10......}

  proto token utf-8-stream { <byte-order-mark> <utf-8-glyph> * }
}

Where x00 represents a byte in hexademical, o0000 in octal, b00000000 etc. For simplicity's sake, each byte should be written out in full. Because some grammar definitions may benefit from it, while the default unit would be a byte, it might be useful to base the grammar not on a byte by byte sequence, but rather words of 16, 32, or 64 bits, enabled via parameterization (is binary[16]). In such cases, an underscore may delineate groups but is otherwise ignored, e.g. 0xffff_ffff for a 32 bit hex value, although 0xf_f_f_f_f_f_f_f would be theoretically valid too).

In a binary grammar, strings are considered invalid, either bare or quoted, although they could included via a method that returns a Blob (similar to a method that returns a string).

Alternatives can be given like in regular Regex, using | (LTM) or || (short circuit).

For character classes, I see two useful ideas:

  • <[ x00 .. x1f ]> would match values from 0 to 31.
  • <[ b.......1 ]> would match odd numbers.
  • <[ b.......1 b00000000]> would match odd numbers or 0.

The middle one, of course, would seem to be pointless given a bare b.......1 would be valid, but when used as a negator, it could be a fair bit more powerful, where < +[x80 .. xff] -[b.......1]> would represent all odd upper ASCII values. I think it would be optimal and not particularly complex to allow a construction like o00.0 .. o04.8 and treating it similar to the string range, e.g., 00.0, 00.1 … 00.8, 01.0, 01.2 with the dot preserved as a wildcard in all. An optimization stage can try to determine if there's a compact representation < +[x80 .. xff] -[b.......1]> becomes b1..._...1, and if not, fall back to a sequential test.

Inline Binex

For use in inline situations, all of the // syntax would be available but adding on as an option :bin:

  • match: m:bin:options/find/
  • substition: s:bin:options/find/replace/
  • substition (nondestructive): S:bin:options/find/replace/
  • transliteration: tr:bin:options/swap/swap/options
  • transliteration (nondestructive): TR:bin:options/swap/swap/

Split bytes

One issue that seems odd, but with real world use, would be to allow captures/tokens betwixt bytes/words. In the aforementioned Zelda 3 article, the format would effectively be for us:

grammar is binary[8] {
  token TOP          { <chunk>+? <end>                       }
  token end          { xFF                                   }
  token chunk        { <header> <data: +$<header><length> >  }
  token header       { b..._.....                            }
  token data($count) { x.. ** {$count}                       } 
}

The catch, however, is how to handle the splitting up of header into the command (first three bits) and the length (latter five bits). I'm not sure what the best syntax to use here would be. No doubt there are other formats where a sub byte item might even be repeated. In this particular case, a work around could be to say

grammar is binary[8] {
  token TOP   { <chunk>+? <end> }
  token end   { xFF }
  token chunk {
    my $*cmd; 
    my $*length; 
    <header> 
    <data: $*cmd, $*length>
  }
  token header { 
    b..._..... { 
      $*cmd    = +$¢ +> 5;
      $*length = +$¢ +& 31 + 1; # length 0 is 1
    }
  }
  
  enum ( Copy => 0, ByteRept => 1, WordRept => 2, ByteIncr => 3, CopyExst => 4);
  
  multi token data(              Copy , $count) { x.. ** {$count} } 
  multi token data(ByteRept | ByteIncl, $count) { x..             } 
  multi token data(WordRept | CopyExst, $count) { x.. ** 2        } 
}

While that would work, it seems inelegant (and making it impossible to handle a token that ends in the middle of a byte/word). Instead, we'll provide an additional option of X and Z, where X means “bit I don't care about, shove it out of the way” and Z means “bit I don't care about, but want it zeroed out”.

The &/&& conjunctions are not commonly used in string-based grammars, but this could be a great place to use them with regularity. Because you could do (at least if needing to split a single byte):

grammar is binary[8] {
  token TOP   { <chunk>+? <end> }
  token end   { xFF }
  token chunk {
    [<cmd> && <length>]
    <data: $<cmd>.head, $<length>.head>
  }
  token cmd    { b..._XXXXX }
  token length { bZZZ_..... }
  
  enum ( Copy => 0, ByteRept => 1, WordRept => 2, ByteIncr => 3, CopyExst => 4);
  
  multi token data(              Copy , $count) { x.. ** {$count} } 
  multi token data(ByteRept | ByteIncl, $count) { x..             } 
  multi token data(WordRept | CopyExst, $count) { x.. ** 2        } 
}

The open question with this approach is what the match value of should be. To reduce the problem:

  token a { <x> & <y>     }
  token b { <x>   <x>     }
  token c { b....XXXX <x> }
  
  token x { b....XXXX     }
  token y { bZZZZ....     }

When matching b11110001 on token a, we'd want x to blobify to b00001111, and yto b00000001. But what would we want to a to blobify to? We have three options: the original match (b11110001), and either of the two captures (b00001111 or b00000001). The answer might seem obvious to just use the original match, but someone might want to do something like in token b, and when matching b11110001 b10100011 expect b to blobify to b00001111 b00001010.

Some off-the-top-of-my-head potential solutions, without regard for complexity of implementation and no particular order:

  • Only modify literals within a given token
    Token c above would blobify r-shifting the first byte, but leaving the second in place, but blobifying $<c><x> would reveal the modification specified in x
  • Create two different methods of blobifying.
    One would return a match directly, and the other would return the modified value (probably the direct match as default). The problem of token a would remain, though, as there would now be two modified values, and if there were a second junction, at least four, etc., with no clear way to distinguish them.
  • Scrap the idea entirely
    I don't really like this one, but I s'pose it's one solution.
  • Only certain tokens allow Z or X
    This could be done via a trait is scoured or with a different declarator. Those tokens would gain the ability to use Z and X in their definitions, but lose the ability to use &, && (| and || would not be affected, since they match one item). Those special tokens that include other special tokens will use the modified values in place, since the lack of & operands means we can guarantee no overlapped values. To use the match values, use a regular token, or as a one-off option, perhaps the syntax <,foo> which is currently otherwise invalid.

My test implementation doesn't yet handle the operators, so I've not had to deal with the question too much yet, but it's looming.

Non-aligned tokens

Perhaps in a later version of the standard (because of the complexity of the code to support it and no doubt speed implications), an optional trait "is maligned" (because O(fun) names) could be added later to allow for non-full-byte/word tokens, without compromising previous code.

Object grammar

The idea for the object grammar came to me when I was processing some part-of-speech tagged text. Each word was an object whose class looks something like this (simplified for this document).

class Word {
  has $.word;
  has $.lexeme;
  has $.part-of-speech;
  has $.number;
  has $.gender;
  has $.tense;
  has $.person;
}

For matching with objects, I think usurping the character class syntax, and hacking it a bit would provide a nice, generally clear syntax to allow for matching on types or attributes/values.

grammar ObjexMatcherSyntax {
  rule  TOP             { '<'  ~  '>' <match-container>+ }
  rule  match-container { '['  ~  ']' <match>            }
  rule  match           { <type>* ':' <arguments>        }
  rule  type            { <sign>      <typename>         }
  token sign            { '-' || '+'? }
}

Arguments would follow standard Raku syntax, with the following interpretations:

  • Positional arguments are smartmatched against the object (e.g. <[Int: 1]> would match an Int value of 1, and <[Int: * > 5, * %% 2]> would match all even Int values over 5 (6,8,10, etc, but not 1 or 7).
  • Attribute arguments are similar smartmatched against the object's attribute. So <[Rat: :denominator(1)]> would match only whole number Rats, and <[Rat: :denominator(1,2,4 ... *)]> would match any power-of-two denominator because smart matching a list checks to see if it contains the element.

It may be that adding the +/- syntax for the type is overkill, and it would be better to keep with only additives, using the pipe | that's used elsewhere in Raku (after all, if someone really wanted, they could define a subset that explicitly handled more complex types). That would greatly simplify the syntax. Thoughts?

Problems/questions

Maybe it's just for my initial use case, but I feel like the typical use case for an Objex would want quicker access to the values/attributes of matched objects. Maybe that's just me though. But it definitely presents a different usecase over strings. Rarely, if ever, do we care about the distinction between a character (single element) and a string (sequence) because Raku doesn't distinguish them. But when dealing with objects, such a distinction IS suddenly important as character : object :: string : list. For this reason, I think it might be a good idea to add an additional declarator to an Objex, which would be simply object (surprisingly and luckily, this is not used at all in the Raku spec!). The contents of the object would be identically to the selector described above (just without the arrow brackets, and only require brackets if more than one selector). Thus the custom declarators of an grammar is objecty would be:

  • objex: backtracking, Match contains a List/Seq
  • rule/token: synonymous in our case, Match contains a List/Seq
  • object: Match contains an object directly.

I suppose it's possible to avoid adding new declarators and just say rule = sequence, token = one off object, but a new concept deserves a new declarator to avoid confusion. Using this idea, assuming I wanted to identify a sequence as a valid noun + adjective sequence, I might do the following:

  grammar ModifiedNoun is objecty { 
    token TOP { 
      <noun>               # the base noun
      <adj-list:           # followed by adjectives that
        $<noun>.gender,        match the noun's gender and
        $<noun>.number>        match the noun's number
    }
    
    token adj-list($g = *, $n = *) {
      [
        <adj: $g, $n>+        # any number of adjectives that agree
        <list-coordinator>    # if there's a list, need an and/or at the end.
      ]
      <adj: $g, $n>      # an agreeing adjective
    }
    
    object noun { Word: :part-of-speech<noun> }
    
    object coordinator  {
      Word: 
        :part-of-speech<coordinator>
        :lexeme('y'|'o')   # only want and/or 
    }
    
    object adj($g = *, $n = *)  { 
      Word: 
        :part-of-speech<adjective> 
        :gender($g)   # default of Whatever matches all
        :number($n)
    }
  }

Without an object option, the TOP and noun tokens would be a bit messier:

  grammar ModifiedNoun {
    token TOP { 
      <noun>
      <adj-list: 
        $<noun>[0].gender,
        $<noun>[0].number
      >
    }
    token noun { 
      <[Word: :part-of-speech<noun>]> 
    }

which works, I guess, but just isn't as clean.

Inline Objex

For use in inline situations, all of the // syntax would be available but adding on as an option :obj:

  • match: m:obj:options/find/
  • substition: s:obj:options/find/replace/
  • substition (nondestructive): S:obj:options/find/replace/
  • transliteration: tr:obj:options/swap/swap/
  • transliteration (nondestructive): TR:obj:options/swap/swap/

Document history

  • Updated April 17th to discuss the class of the & operator with the X and Z values, and fixed a few other typos (/foo/bar/options isn't Raku, duh)
  • Updated April 9th to integrate bgills' excellent suggestions on X and Z and inline naming, and fixed typos
@b2gills

This comment has been minimized.

Copy link

@b2gills b2gills commented Apr 9, 2020

It should be :my $*cmd; not my $*cmd; as :my is already used for this in regular regexes.

I think that whatever the syntax is, things which already exist in regexes should look the same.

So I think bin/…/ should be m:bin/…/.
Perhaps even m:bin(8)/…/.

That also means that anything that looks like an existing regular regex feature, should most likely be roughly equivalent to that regex feature.


The header from your example could be split up something like this:

token header       {
    & $<command> = b..._XXXXX
    & $<length>  = bXXX_.....
}

(The first & is optional.)

Something like that already works in regexs.

'12345678' ~~ /
    & $<command> = (...) .....
    & ...  $<length>  = (.....)
/

Both sides of & have to be the same length. The same as regular regexs.
I don't think that there should be a binary regex way to add 1 though. There is a reason we are able to embed regular Raku code.

X is commonly used in binary truth tables as don't care.
We could have b..._XXXXX shift right 5 bits.

For the times where you want it to be forced to zero, use Z instead.

0b01011101 ~~ m:bin/ b...Z_ZZZZ /
0b01011101 +& 0b1110_0000

0b01011101 ~~ m:bin/ b...X_XXXX /
0b01011101 +> 5

0b01011101 ~~ m:bin/ b01..._ZXX /
(0b01011101 +> 2) +& b111111110
(0b01011101 +> 3) +< 1 # or this

0b01011101 ~~ m:bin/ b...X_.... /
((0b01011101 +& b1110_0000) +> 1) +| (0b01011101 +& b0000_1111)

I would like it if the following worked.
Particularly b11+_0_Z+ and bZ+?_0_.+.

grammar UTF-8 is binary[8] {
  token TOP { <byte-order-mark>? <utf8-glyph>* }

  token byte-order-mark        { xEF xBB xFF }

  proto token utf8-glyph { * }

  token utf8-glyph:ASCII {
    b0.......
    { make chr +$/ }
  }

  token utf8-glyph:long {
    [
    &&  $<length> = b11+_0_Z+
        :my $trailing-count = 8 - $<length>.lsb - 1;
    &&  $<first> = bZ+?_0_.+
    ]
    <trailing> ** {$trailing-count}

    {
      my Int $make = +$<first>;
      for @<trailing> {
        $make = ($make +< 6)   +|   +$_
      }
      make chr $make;
    }
  }

  token trailing {
    # make sure it starts off with b10
    <?before(b10_......)>
    # but only actually match the last 6 bits
    bXX_......
    # bZZ_...... # would also work
  }
}
@alabamenhu

This comment has been minimized.

Copy link
Owner Author

@alabamenhu alabamenhu commented Apr 9, 2020

First, thanks for all the really great comments/observations!

Re :my, thanks for catching the typo. (I warned it was written hastily ha).

Absolutely agree on keeping things looking the same as much as possible. Honestly, I kind of forgot about the m/…/ syntax (I just use bare /…/). I think the idea of using the colon is interesting and am probably already convinced assuming I can get it to work correctly. Right now I'm still having issues overriding Raku's grammar, so the "outer" syntax of that sort has to go to the backburner. (seems I stumbled on a minor bug but lizmat couldn't track it down on a quick glance).

I really, really like the idea about X and Z, as it's a great way for isolating data, and elegantly automates what would otherwise be boilerplate +& and +>/ ops. How do you imagine that X should be interpretted in these situations where it's not purely right-aligned?

0b1111_1111 ~~ m:bin/ bXXXX_.... /
    # equiv to X/Z --> 0b0000_1111
    # left shift   --> 0b1111_0000
0b1111_1111 ~~ m:bin / b..XX_XX.. /
    # right shift --> 0b0000_1111
    # left shift  --> 0b1111_0000
    # equiv to Z  --> 0b1100_0011
0b1111_1111 ~~ m:bin / bXX.._..XX /
    # right shift --> 0b0000_1111
    # left shift  --> 0b1111_0000
    # equiv to Z  --> 0b0011_1100

I feel like creating complicated rules of when it'd be left or right shift is going to create more headache for users than it's worth, so I'd be fine with catching it and giving an error that it can only be on the trailing edge of a byte/word. I can't imagine left shift being very commonly used like right shifting, but perhaps an L could be done just in case.

Also, should there be a way to, instead of zero-ing it out, one-ing something in? (or some other value). For binary, it'd be easy to say Y = fill with 1, but in higher bases, that wouldn't work. If it'd be useful, perhaps x...Z(f) to replace the fourth nibble with 0b1111.

Re + inside of a byte/word, I can see how that might be useful, but visually throws me off some. E.g., does b11+_0_Z+ get interpreted as [b11+_0_Z]+ or b1[1+]_0_[Z+]? Logically, of course, the former wouldn't make sense since it'd allow for a split byte, but visually, it makes me think a few times what's intended.

I'll try to update the proposal here again soon. Thanks again for looking it over!

@skids

This comment has been minimized.

Copy link

@skids skids commented Apr 23, 2020

I'd throw the word "endianness" into the mix. If you're dealing with binary formats you'll find places where embedded values are not in the same endian format as their surrounding material. Worst case situations may exist where shifts and endianness interact so that can get a little maddening.

Other than that, it's a bit of a tangent, but maybe worth a read... there have been several speculations on how to merge grammars and objects such that one definition covers not just parsing a blob of data into a structure or object, but also going from object/structure back to blob.

(Whoops wrong link before: https://gist.github.com/skids/cfb129509b0849f50deb)

@alabamenhu

This comment has been minimized.

Copy link
Owner Author

@alabamenhu alabamenhu commented Apr 23, 2020

I'd throw the word "endianness" into the mix. If you're dealing with binary formats you'll find places where embedded values are not in the same endian format as their surrounding material. Worst case situations may exist where shifts and endianness interact so that can get a little maddening.

Indeed, that's one think I've been mulling over. Perhaps something like this:

grammar Endian is binary(16) { # endian('big') as default
  token TOP { <lil-exact> && <big-exact> && <lil-shift> && <big-shift> }
  token lil-exact is endian('little')            { 0x.... }
  token lil-shift is endian('little') is scoured { 0x..XX }
  token big-exact           { 0x.... }
  token big-shift is scoured { 0x..XX }
}
my $blob = blob[uint16].new: 0x00ff;
say Endian.parse: $blob;
# TOP: 00 FF
#   lil-exact: FF 00
#   lil-shift: 00 FF
#   big-exact: 00 FF
#   big-shift: 00 00
#

The .parse() takes a blob which doesn't have an Endianness attribute. In this case, the grammar will assume that it matches the grammar's endianness.

Edit: Perhaps also the trait could be applied only to the grammar, and then instead using an inline modification à la regex: token foo { :l 0x..ff } or token foo { :b 0xff.. }.

Another interesting question is if it should be possible to have a multi token for diffeent endiannesses. For instance:

   multi token neg-int(:little) { 0b...._...._1..._.... }
   multi token neg-int(:big)    { 0b1..._...._...._.... }

From a backend stand point, everything is possible. Question is just how to make it evident to the user (and might be cool to support the two middle endians too)

@Skarsnik

This comment has been minimized.

Copy link

@Skarsnik Skarsnik commented Apr 27, 2020

I will look into writing the full zelda3 grammar later with your proposal, you could also look at like the elf binary structure to have a more 'real' case.

@alabamenhu

This comment has been minimized.

Copy link
Owner Author

@alabamenhu alabamenhu commented Apr 27, 2020

I will look into writing the full zelda3 grammar later with your proposal, you could also look at like the elf binary structure to have a more 'real' case.

Ooh, the ELF is a great test case. As soon as I have the library working to handle assertions (<foo>/<.bar: 2> type stuff) and named captures, I'll try to implement that.

@gitjeff2

This comment has been minimized.

Copy link

@gitjeff2 gitjeff2 commented May 1, 2020

I poke around undocumented binaries as a hobby, mostly in the retrocomputing space. So, I think I have some useful thoughts and feedback. I have looked high and low for something that does what you're trying to do without much real luck, so this proposal is very much of interest to me. All that said, I'm very new to Raku and that's just going to show in some of my comments. I'm still wrapping my head around grammars, so please forgive my obvious ignorance on that front.

I was on the fence about giving feedback because "hobby level" understanding is not the same thing as professional understanding, but here goes...

@alabamenhu for the Inline Binex, you probably want to support bitwise AND, OR, XOR, and NOT along with arithmetic shifts. All of these could potentially be chained. I realize this can be done later once the struct is extracted and placed into a Raku-native data structure. But, I expect it to be especially convenient to do right in the binex in many cases. I think spotted a post where someone may have mentioned this already.

An example use-case might be extracting data with this grammar, placing it into a data structure. Then, reusing the bytes that were just extracted and performing a rolling xor to compute a simple checksum to verify the integrity of the data against some other struct in the file containing the expected value.

As for the split bytes situation, all I can say is there are a lot of sub-byte data structures out there. So, yes, it's a good idea to express bytes as b00000000 instead of not representing them to the programmer and making him fish for them. Two specific sub-byte data structure examples that come to mind as reasonably common are:

  1. Bitfields. Think along the lines of "the processor status register" but to declare certain things on or off within a struct. This might impact how a given chunk need to be parsed.
  2. Nibble-sized data structures. By far the most common example of this would be Binary Coded Decimal. Some financial software uses BCD internally to avoid precision issues with floats.

I really look forward to seeing your solution and hope the end result is elegant. Would the comment about the m/../ selector work for these cases?

Questions:

  1. If I had a series of 9 bit chunks, 8-bits with 1 parity bit to make for a simple example, could this grammar handle that? Obviously, I expect there would be a performance penalty for dealing with non-byte aligned structs.
  2. Can this handle either partly non-standard or completely non-standard character sets?

A modern example of where a non-standard character set might show up is desktop publishing files. Sometimes the Unicode Private Use Area is used to represent alternate glyphs of a specific character. An older example would be game cartridges that have completely bonkers character sets that don't conform to any convention.

Some other thoughts:

I realize you probably can't make this grammar bidirectional from the same definition. (I think you'd run into the Halting problem trying to do that — but don't quote me.) That said, some facility to go from Raku native structures directly to a binary struct via a grammar would be amazing. Even the well-known Kaitai Struct only extracts, it can't be used to write back to a binary blob. Is the trip from Raku native struct back to binary via grammar on your proverbial radar?

This is going to seem farther afield, but hear me out.... Is there an analogous grammar (in use or proposed) for handling digital signals? The reason I ask is lots of binary data isn't just baked into an EEPROM lots of binary data is flying over the air in some sort of digital encoding scheme. I think being extract meaningful data up the stack from encoded waveform to binary frame/packet/blob and then extract useful data all by chaining grammars would be mind-bending and extremely useful in a lot of applications.

Very cool project. I hope at least some of my comments were useful.

@alabamenhu

This comment has been minimized.

Copy link
Owner Author

@alabamenhu alabamenhu commented May 2, 2020

I was on the fence about giving feedback because "hobby level" understanding is not the same thing as professional understanding, but here goes...

@gitjeff2 Thank you for taking a look at it! I actually do more work with object pattern matching (the whole impetus for this project came from wanting regex-like matching on a language corpus), but I decided to work on Binex first because I felt like it would be more useful for people than Objex, so any insight from people that work or have worked with binary stuff is really appreciated. Sorry for the long answers coming but you had lots of interesting stuff/questions.

@alabamenhu for the Inline Binex, you probably want to support bitwise AND, OR, XOR, and NOT along with arithmetic shifts. All of these could potentially be chained.

This is definitely an interesting use case. What would you imagine the interface looking like? Right now, the way for doing that type of stuff would be (assuming a grammar that defines a token a and token b)

method foo ($/) {
   make $<a> ~& $<b>
}

Where ~& is Raku's blobby binary AND (this is how it works in regex, I haven't implemented actions just yet). But that does mean that it's not as cleanly available: you'd need to reference $<foo>.made instead of just $<foo>. That can be done without an actions object several different ways, by using a code block in the token, like:1

token foo { 
  $<a> $<b> { make $<a> ~& $<b> }
}

or by using a dynamic variable that would have been defined in the enclosing token:

token foo { 
  $<a> $<b> { $*and = $<a> ~& $<b> }
}

The latter is probably the better one, because the dynamic variable can still be accessed in the method call. Using make will thwart anyone writing an actions class. I wonder, though, if there could be a better interface (or honestly, if there should be one to avoid kitchen-sinking it). Perhaps in addition to make to push data to the AST, there could also be a scour to generate the scoured data. The big problem, of course, is that any modification of data in place takes a performance penalty because we have to allocate, calculate, and hold on to an extra blob. Dynamic variables are a bit more lightweight here.

An example use-case might be extracting data with this grammar, placing it into a data structure. Then, reusing the bytes that were just extracted and performing a rolling xor to compute a simple checksum to verify the integrity of the data against some other struct in the file containing the expected value.

There are actually much better ways to handle this type of a situation. You can define a dynamic var $*checksum and at each place where it needs to be recomputed, do something like:

token data-chunk { 
    <chunk>+ <.checksum: @<chunk>>
}

token checksum (@chunk { 
  <?{   #`[edit $*checksum here, and return true if it checks out, a false dies]  }>
}

If the checksum is only being done on the data-chunk, it could even be inlined directly in data-chunk. But if you can't check it until you're finished parsing, it might be better to delay doing the checksum until you have the full AST.

As for the split bytes situation, all I can say is there are a lot of sub-byte data structures out there. 1. Bitfields. Think along the lines of "the processor status register" but to declare certain things on or off within a struct. This might impact how a given chunk need to be parsed.

Changing how a chunk is parsed based on a bit actually can already be done easily (if you peak around in Raku's own grammar, a similar technique gets used a lot):

proto token integer { * }
token integer:positive { b0...... }
token integer:negative { b1...... }

That's one way. Here's another way that will separately capture the flag and the (absolute) value. It's wordier, but perhaps more self-documenting and better expands to other situations:

proto token integer { * }
token integer:positive { <plus>  && <abs-int> }
token integer:negative { <minus> && <abs-int> }

token plus  { b1....... }
token minus { b0....... }

token abs-int is scoured { bZ....... }
  1. Nibble-sized data structures. By far the most common example of this would be Binary Coded Decimal. Some financial software uses BCD internally to avoid precision issues with floats.

When aligned to bytes as these are, they could still be handled easily without needing to go to a subbyte level:

token bcd { x.. ** { $*length } }

Or alternatively, the following might make procesing a bit easier, but wouldn't be strictly necessary.

token bcd          { <double-digit> ** { $*length } }
token double-digit { <upper-digit> & <lower-digit> }
token  upper-digit { x.X }
token  lower-digit { xZ. }
  1. If I had a series of 9 bit chunks, 8-bits with 1 parity bit to make for a simple example, could this grammar handle that? Obviously, I expect there would be a performance penalty for dealing with non-byte aligned structs.
    This actually would effectively require not aligning on bytes. If I were to implement the idea several comments above about using an is maligned token, that might just handle it. As you recognize, it would be a pretty significant performance penalty. I think it needs to be implemented, I'm just not sure if the is maligned trait is the best way to go about doing it. It might be that if you know in advance that you're working with a regular, if non-standard, bit size, it would be substantially more performant to reread the 9-bit value into a 16-bit blob. More memory, but then all operations could be run as if 16bit. That could allow defining the grammar (or some section of it) to be unique bit and then handling the conversion at the start. Nothing to do to save performance though if the length can't be known in advance.
  1. Can this handle either partly non-standard or completely non-standard character sets?

Since it runs on just the binary data, it won't care. This is more of a concern for regular regex which handles private use without problem (it's still part of the Unicode standard, just defined as a glyph without defined properties). What I would do for this is after processing it (assuming it, like UTF-8 for instance, needs to be handled in some interesting way at the byte level), handle the encoding in an actions object. That way you could say FormatGrammar.parse($string-blob, :actions<FormatActions>).made where your TOP token makes a Unicode-encoded value.

I realize you probably can't make this grammar bidirectional from the same definition. That said, some facility to go from Raku native structures directly to a binary struct via a grammar would be amazing. Is the trip from Raku native struct back to binary via grammar on your proverbial radar?

There acutally has been some interesting work on this done by someone (I can't find the link) with regular regex, by doing some sort of Raku magickery. If in Raku we defined a standard format for serialization, than there's no reason we could at least parse it. I would definitely leave that to module-space, though. I believe someone already did similar for JSON, so there's a model to work from (I mean, you can already .raku an object to get Raku code that, in theory, should recreate the object if you EVAL it)

This is going to seem farther afield, but hear me out.... Is there an analogous grammar (in use or proposed) for handling digital signals? The reason I ask is lots of binary data isn't just baked into an EEPROM lots of binary data is flying over the air in some sort of digital encoding scheme. I think being extract meaningful data up the stack from encoded waveform to binary frame/packet/blob and then extract useful data all by chaining grammars would be mind-bending and extremely useful in a lot of applications.

I don't have any experience working with that, so I fear any speculation would be idle and not helpful. If the data you get comes in in chunks, you probably could use it on each chunk, and let the result of that dictate how future chunks are processed (by setting a flag, etc). @jnthn has been doing some stuff with audio processing lately, so maybe he could chime in.

Also, welcome to Raku! I think you'll find it a language that's legitimately fun to code in. One thing I'd definitely say is to check out doing some grammars and try writing accompanying actions classes for them (or make a separate new action class for an extant grammar). It sounds a bit like you might be doing what I did early on and try to put too much action into the grammar (I tried writing a brainfuck interpreter that would run inside of the grammar itself. Doesn't work beacuse tokens might fail but the side effects were still there. Making an AST in an actions class afterwards instead made everything work better and cleaner). If you have any questions, definitely check out #raku on freenode, it might take a bit before someone sees the question, but someone will always eventually get around to answering.


  1. in the current test implementation, you'd need double brackets, because in module space we can't access the exact pblock definition to know which } is the end of the block.
@gitjeff2

This comment has been minimized.

Copy link

@gitjeff2 gitjeff2 commented May 2, 2020

Not being fully up to speed on Raku, I can't comment on an alternative. Yor solution for the standard bitwise operators seems reasonable. The use of $<foo>.made seems good. If I read that right — correct me if I'm wrong — the original bits are preserved and the output of ~& lives at $<foo>.made. I think that's actually the way it should work.

I wonder, though, if there could be a better interface (or honestly, if there should be one to avoid kitchen-sinking it). "The perfect is the enemy of the good," I say start with the reasonable and don't fear eventual rewrites as feedback comes in.

Oh, your lightweight checksum comment makes sense.

There actually has been some interesting work on this [bidirectional parsing] done by someone (I can't find the link) with regular regex.

Oh, wow. I thought that ran smack dab into a limitation of the properties of computation. Don't drive yourself crazy, but if you do happen to find the link I'd really like to read it.

Also, welcome to Raku! I think you'll find it a language that's legitimately fun to code in.

Thanks for the warm welcome and for taking the time to respond in such detail. I was really afraid of the "I look forward to your patch [/sarcasm]" response. (RTFM not really applying here.) The Perl/Raku crowd isn't what I remember. I'm glad you found useful nuggets in my thoughts, questions, and other musings.

I hope @jnthn gets the chance to chime in. I'd like to hear what he says about Raku and DSP.

One thing I'd definitely say is to check out doing some grammars and try writing accompanying actions classes for them (or make a separate new action class for an extant grammar). It sounds a bit like you might be doing what I did early on and try to put too much action into the grammar (I tried writing a brainfuck interpreter that would run inside of the grammar itself.

You're not far wrong! I'm bouncing between brian d foy's Learning Perl 6 and Parsing with Perl 6 and Grammars. Things still haven't quite come into focus. Ideas are still disconnected and they haven't stratified at the proper conceptual layer. Any suggestions on additional material would be most welcome.

I have a strong Bash/Python grep/sed/awk/perl 5 one-liners background, so traditional regexes I know. But, grammars feel a bit like "black magic" the way regexes once did when I first tried to learn them. They're definitely a step up conceptually.

Making an AST in an actions class afterwards instead made everything work better and cleaner).

Interesting. That's immediately helpful.

On the "writing an interpreter" front, it seems we have similar interests. I have a small collection of pet projects going right now. I'm working on a script to help me clean up MOS 6502 dissasembly. Basically, I'm having to massage radare2 output so that it takes less work get 64tass to assemble it.

I've also thought about writing a Forth interpreter later down the line. Why? Because, all of the math is in RPN and, well, that's a special kind of brain-eff. So, it seems like an interesting challenge.

I'm nowhere near as old as some of this makes me sound. I just like preserving old stuff.

@b2gills

This comment has been minimized.

Copy link

@b2gills b2gills commented Apr 10, 2021

I really, really like the idea about X and Z, as it's a great way for isolating data, and elegantly automates what would otherwise be boilerplate +& and +>/ ops. How do you imagine that X should be interpretted in these situations where it's not purely right-aligned?

Ignore the implementation of X and Z for the moment.
The idea was for what you might want the premade .ast/.made value to be.

0b1111_1111 ~~ m:bin/ bXXXX_.... /
    # equiv to X/Z --> 0b0000_1111
    # left shift   --> 0b1111_0000
0b1111_1111 ~~ m:bin / b..XX_XX.. /
    # right shift --> 0b0000_1111
    # left shift  --> 0b1111_0000
    # equiv to Z  --> 0b1100_0011
0b1111_1111 ~~ m:bin / bXX.._..XX /
    # right shift --> 0b0000_1111
    # left shift  --> 0b1111_0000
    # equiv to Z  --> 0b0011_1100

I feel like creating complicated rules of when it'd be left or right shift is going to create more headache for users than it's worth, so I'd be fine with catching it and giving an error that it can only be on the trailing edge of a byte/word. I can't imagine left shift being very commonly used like right shifting, but perhaps an L could be done just in case.

Like I said, ignore the implementation of X and Z.
The idea was for the high-level semantics.
(Torture the implementers on behalf of the users.)

The idea was for X to remove itself from the result, and for Z to set itself to zero.

That is X does not mean shift. It just happens that shifting is usually the inevitable result of having used X.

I only ever considered X shifting rightwards (smaller number). If someone wanted them to shift leftwards (larger number) I figured they could do it themselves afterwards. (It wouldn't be a common operation.)
My reasoning being that it would most likely be used as a number for further matching. Like a count of following things to match.

As an example of what I was thinking is the following.
(Pretend that a/b/c etc match the same as .)

/aaaa_aaaX/
# 0aaaa_aaa

/aaaX_bbbb/
# 0aaabbbb

/aaXb_bXcc/
# 00aabbcc

/aaZZXXbb/  or /aaXXZZbb/ or /aaXZXZbb/ or /aaZXZXbb/ or /aaZXXZbb/ or /aaXZZXbb/
# 00aa00bb

Note that that last example I wrote in 6 different but equivalent ways.
The code to produce the .ast (00aa00bb) from them could/should be exactly the same for them.


Now to specifics.

For some number of Xs on the right, a simple shift works. (Pseudo code)

/...._.XXX/
$_ +> X.count

It works because the bits fall off the end.

For some combination of X and Z on the right

/...._ZZXX/
/...._XXZZ/
/...._XZXZ/
/...._ZXZX/
/...._ZXXZ/
/...._XZZX/

#00.._..00

Since those have identical results:

You can either do two shifts

($_ +> (Z.count + X.count)) <+ Z.count
# ($_ +> 4) +< 2

or a shift and a mask

($_ +> X.count) +&  +~((2 ** Z.count) - 1)
# ($_ +> 2) +& 0b1111_1100

or

($_ +& +~((2 ** (X.count + Z.count)) - 1) ) +> 2
# ($_ +& 0b1111_0000) +> 2

The only thing that someone using it is concerned is that the result is what they expect. So any of the above is fine.

Any other code that ends up with identical results would also be fine.

For X in the middle

/...X_X.../

You will have to separate the parts, and combine them again later.

# /aaaXXbbb/
my $aaa = $_ +& 0b1110_0000; # mask
my $bbb = $_ +& 0b0000_0111;

my $result = ($aaa +> 2) +| $bbb # combine

Of course there are more ways to implement something like that.

For X on the left

/XX.._..../

This would have identical results as having used Z. At least that was my thoughts.

I can see how you came to a different conclusion to what I meant. I should have been more clear.

It would have moved something on it's left into those two spots, but there is nothing there to move so just zero it.

$_ +& 0b0011_1111

# or

$_ +^ 0b1100_0000

Simplified handling

The simple way to create code which ends up with the correct result (using the semantics I came up with) is to group parts together.

/ZXaa_ZZXX_bbZZ_XXcc/
#ZX
#  aa
#     ZZXX
#          bb
#            ZZ_XX
#                 cc


#   aa      bb     cc
#   \\      \\     ||
#    \===\   \\    ||
#        \\   ||   ||
# 0000_00aa_00bb_00cc

You would mask and capture the aa / bb / cc, shift them, and combine them after.

my $aa = $input +& 0b0011_0000_0000_0000;
my $bb = $input +& 0b0000_0000_1100_0000;
my $cc = $input +& 0b0000_0000_0000_0011;

$aa +>= 12 - 8; # from 12 to 8
$bb +>= 6 - 4;
$cc +>= 0 - 0; # nop

my $result = $aa +| $bb +| $cc;

# from __aa ____ bb__ __cc
#        \\      \\     ||
#         \===\   \\    ||
#             \\   ||   ||
#   to ____ __aa __bb __cc

The following is somewhat of an (untested) pseudo code for compilation of the action code.

my $bin-regex = 'ZXaa_ZZXX_bbZZ_XXcc';
$bin-regex .= trans( '_' => '' );

my @parts = $bin-regex.comb( / <[ZX]>+ | <:Ll + [.]>+ / );
# ["ZX", "aa", "ZZXX", "bb", "ZZXX", "cc"]

my $from-pos = 0;
my $to-pos = 0;
my @actions;

for @parts.reverse {
  when .starts-with('Z'|'X') {
    $from-pos += .chars; # skip the values from the input
    $to-pos += .comb('Z'); # move leftwards the output only for Z actions
  }
  default {
    push @actions, BinCap.new( :$from-pos, :$to-pos, count => .chars )

    # setup for next loop
    $from-pos += .chars;
    $to-pos += .chars;
  }
}

# this is the code that gets installed into the actions for the bin regex
# in the future it could be generated using RAKU-AST
my $action-code = -> $input {
  my $result = 0;
  for @actions -> $action {
    $result +|= $action.mask-and-shift($input)
  }
  $result
}

class BinCap {
  has $.from-pos;
  has $.to-pos;
  has $.count;

  method input-mask () {
    (2 ** $!count - 1) +< $!from-pos
  }

  method mask-and-shift ( $input ){
    ($input +& $.input-mask) +> $!from-pos - $!to-pos
  }
}

Also, should there be a way to, instead of zero-ing it out, one-ing something in? (or some other value). For binary, it'd be easy to say Y = fill with 1, but in higher bases, that wouldn't work. If it'd be useful, perhaps x...Z(f) to replace the fourth nibble with 0b1111.

I'm sure there is a better letter for it. (Perhaps something Unicodey?)

It would actually be fairly easy.

# /...._YYYY/
my $result = $input +| 0b0000_1111;

Re + inside of a byte/word, I can see how that might be useful, but visually throws me off some. E.g., does b11+_0_Z+ get interpreted as [b11+_0_Z]+ or b1[1+]0[Z+]? Logically, of course, the former wouldn't make sense since it'd allow for a split byte, but visually, it makes me think a few times what's intended.

b1[1+]_0_[Z+] of course, to match Raku regexes. (I think there should be a way to write b[11]+_0_[Z+].)

  match    →   result
1111_110Z
1111_1101  →  1111_1100
1111_1100 

1111_10ZZ
1111_1011  →  1111_1000
1111_1010
1111_1001
1111_1000

…

110Z_ZZZZ
1101_1111  →  1100_0000
1101_1110
1101_1101
…
1101_0111
…
1101_1100
…
1100_1100
…
1100_0000

…

1111_110Z
1111_1101 →  1111_1100
1111_1100

These would not match (needs at least two 1 bits)

1011_1111
1000_0000
1010_0000

Neither would this (no zero bit)

1111_1111

This also wouldn't match ( nothing for Z+ to match

1111_1110

Basically b1[1+]_0_[Z+] would mean match

  1. some number of 1 bits (at least 2)
  2. a 0 bit
  3. ignore the rest of the bits (at least 1), but in the result set them to 0

This is a bit more advanced thought, and may need to wait for a v2


After thinking on it, someone may want to collapse to the left sometimes.
So I may have a rethink on how X works. Or if it should even be using the letter X for it.

The only reason I used X is because texts about binary tend to use X for don't-care.

I then realized that I might want to either collapse or set to zero. So I redesignated X to be collapse, and Z for zero.
I hadn't thought on it much more than that.
I hadn't even thought about higher bases at all.

The second rule of language design is Larry I'm allowed to change my mind. 😄

It took many years and iterations to get where Raku regexes are currently. It would be unlikely that a binary version would be as well designed in the first try.

(Let's see if it takes another full year for me to respond this time.)

@alabamenhu

This comment has been minimized.

Copy link
Owner Author

@alabamenhu alabamenhu commented Apr 11, 2021

(Let's see if it takes another full year for me to respond this time.)

Ha, no worries. I put this on the back burner until RakuAST comes out. I really appreciate the detailed examples — they will give me plenty of food for thought.

Thankfully, I'm happy to take time with this. I'd rather get it right than do it fast.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment