smls/perl6_binary_parsing.md

## perl6_binary_parsing.md

      
    Raw
  

              perl6_binary_parsing.md
            
          
    Binary Parsing in Perl 6

When parsing things like binary file format headers, you generally need to know three
things for each data field:

how many, and what kind of, bytes to gobble

(e.g. input stream --> Buf:0x<0b 00>)
how to unpack those bytes into a number, or other low-level type

(e.g. Buf:0x<0b 00> --> 11)
how to transform that number into meaningful data

(e.g. 11 --> "v1.1")

Of course, those three things are not necessarily orthogonal - for example the
unpack rule often already implies the number of bytes to gobble, and the
boundary between the 'unpack' and 'transform' steps is fuzzy. Also, constraints
become applicable during the 'gobble' step, but are often more conveniently
expressed in terms of the post-unpack value.
The traditional Perl way to do such parsing, is using imperative
while/if/else logic that gobbles one or more fields with read(), unpacks
them with unpack(), and transforms them with custom Perl code. This approach
is flexible, but often tedious.
Enter grammars. Perl 6 grammars support all the parsing logic you could hope
for, but unfortunately they operate only on opaque Unicode string characters -
not on bytes. The design documents
speculate that operating on the
byte level will at some point be supported, but jnthn++ clarified that this
feature won't make it into the first stable release of Perl 6.
Nonetheless, I think it is a good time to start thinking about this now, in
particular with these goals:


Determine what the "optionally operate on the byte level" feature
would actually entail, and if the current Perl 6 regex & grammar stuff is
flexible enough to allow adding this feature in a future Perl 6.x release
without major upheaval.


Determine what kind of additional features and syntactic sugar would be
needed beyond that bare-bones support, to minimize redundancy between the
gobble + unpack + transform steps and to make grammar-based binary
parsing pleasant enough for actual use.


Determine how much of the above really needs Perl 6 core additions, and how
much of it could be provided by modules. (Or at least, whether modules could
provide alternative interim solutions.)


This post is meant to help start that discovery process.

Table of Contents

Case study 1: Parsing an ICO header (strictly validating)

With bare-bones binary parsing support
With convenience methods for binary parsing & unpacking


Case study 2: Parsing a ZIP local header (non-validating)

With convenience methods for binary parsing & unpacking

With a separate action class


The :bytes modifier/pragma
The hypothetical Grammar::Binary role
Another crazy idea: A quote-like operator for regex-based unpacking


Case study 1: Parsing an ICO header (strictly validating)

An ICO (Windows icon)
file can contain multiple bitmap images which portray the same logical icon at
different sizes / bit depths. The ICO file header has an entry for each
contained image (specifying the meta-data needed to select and decode it); the
actual bitmap data of the images starts after the file header.
Suppose that we want to read the header of such a file, and represent each of
its image entries as an object of the following class:
class ICO::ImageEntry {
    has $.width;         # between 1 and 256
    has $.heigth;        # between 1 and 256
    has $.palette-size;  # between 1 and 255, or undefined
    has $.color-planes;  # between 1 and 4
    has $.bit-depth;     # 1, 4, 8, 16, 24, 32, or undefined
    has $.data-size;
    has $.data-offset;
}
We shall do this by parsing the binary file using a grammar:
my Buf $raw = slurp "someicon.ico", :bin;
my ICO::ImageEntry @entries = ICO::Header::Grammar.subparse($raw).ast;
But what would that ICO::Header::Grammar grammar look like?
With bare-bones binary parsing support

If the speculated :bytes modifier
were implemented (details below), one could probably use that together with the
built-in Buf.unpack method to write the grammar like this:
grammar ICO::Header::Grammar {
    token TOP {
        :bytes
        \x00 \x00 \x01 \x00  # signature
        <image-count>
        <image-entry> ** { $<image_count>.made }
        
        { make $<image-entry>».made; }s
    }
    
    token image-count { :bytes . . { make $/.Buf.unpack('S<') } }
    
    token image-entry {
        :bytes
        <image-width> <image-height>
        <palette-size> [ \x00 | \xff ] <color-planes> <bit-depth>
        <data-size> <data-offset>
        
        { make ICO::ImageEntry.new: |$/.hash.deepmap(*.made); }
    }
    
    token image-width  {
        :bytes .
        { make $/.Buf.unpack('C') || 256  }
    }
    token image-height {
        :bytes .
        { make $/.Buf.unpack('C') || 256  }
    }
    token palette-size {
        :bytes .
        { make $/.Buf.unpack('C') || Any  }
    }
    token color-planes {
        :bytes <[\x00..\x04]> \x00
        { make $/.Buf.unpack('S<') || 1 }
    }
    token bit-depth {
        :bytes [\x01 | \x04 | \x08 | \x10 | \x18 | \x20] \x00
        { make $/.Buf.unpack('S<') || Any }
    }
    token data-size {
        :bytes <!before \x00 \x00 \x00 \x00> . . . .
        { make $/.Buf.unpack('L<') }
    }
    token data-offset {
        :bytes . . . .
        { make $/.Buf.unpack('L<') }
    }
}
Note: I made it a strictly-validating grammar, to demonstrate how various value
constraints can be expressed.
With convenience methods for binary parsing & unpacking

Imagine a Grammar::Binary role (possibly provided by a CPAN module) which you
can mix into your grammar, to provide a little more convenience. I'll expand on
what exactly it would do in a dedicated section below; for now, a demonstration
using a rewritten version of the grammar from the previous section:
grammar ICO::Header::Grammar does Grammar::Binary {
    token TOP {
        \x00 \x00 \x01 \x00  # signature
        <image-count=.uint16l(1..*)>
        <image-entry> ** { $<image_count>.made }
        
        { make $<image-entry>».made; }
    }
    
    token image-entry {
        $<width>        = <.uint8(1..255, 0=>256)>
        $<height>       = <.uint8(1..255, 0=>256)>
        $<palette-size> = <.uint8(1..255, 0=>Nil)>
        [ \x00 | \xff ]
        $<color-planes> = <.uint16l(0=>1, 1..4)>
        $<bit-depth>    = <.uint16l(0=>Nil, 1, 4, 8, 16, 24, 32)>
        $<data-size>    = <.uint32l(1..*)>
        $<data-offset>  = <.uint32l>
        
        { make ICO::ImageEntry.new: |$/.hash.deepmap(*.made); }
    }
}
Definitely shorter, and I think also easier to read and maintain.
Case study 2: Parsing a ZIP local header (non-validating)

ZIP archives are structured quite differently from ICO icons. The central
index of contained files is near the end of the stream, not at the beginning,
and within the stream each contained file is directly prefixed by a "local file header" which could be represented by an object of the following class:
class ZIP::LocalHeader {
    has $.format-version;    # e.g. "1.0"
    has $.file-attr-type;    # e.g. ZIP::AttrType::MS-DOS
    has $.flags;             # e.g. [?«<0 0 0 1 0 0 1 0 0 0 0 1 1 0 0 0>]
    has $.compression;       # e.g. ZIP::Compr::Deflate
    has $.modification-date; # e.g. DateTime.new(...)
    has $.crc-32;            # e.g. Buf.new(...)
    has $.compressed-size;   # e.g. 283618
    has $.uncompressed-size; # e.g. 357323
    has $.filename;          # e.g. "report.pdf"
    has $.extra-header;      # e.g. Buf.new(...)
}
Unlike in ICO::ImageEntry, the fields aren't all numbers: They're a mix of
numbers, strings, buffers, a date, an array, and enum types.
For the sake of simplicity, let's say we only want to parse the first
embedded file's local header, which is found right at the start of the zip
archive:
my Buf $raw = slurp "archive.zip", :bin;
my ZIP::LocalHeader $firstheader = ZIP::LocalHeader::Grammar.subparse($raw).ast;
With convenience methods for binary parsing & unpacking

Assuming the same Grammar::Binary role as before, the grammar could look like
this:
enum ZIP::AttrType <
    MS-DOS       Amiga OpenVMS   UNIX       VM_CMS
    Atari_ST     OS_2  Macintosh Z-System   CP_M
    Windows_NTFS MVS   VSE       Acorn Risc VFAT
    alt_MVS      BeOS  Tandem    OS_400     OS_X
>;

enum ZIP::Compr (
    Uncompressed =>  0, Shrink    =>  1, Reduce1  =>  2, Reduce2  =>  3,
    Reduce3      =>  4, Reduce4   =>  5, Implode  =>  6, Tokenize =>  7,
    Deflate      =>  8, Deflate64 =>  9, ImplodeX => 10, Bzip2    => 12,
    Lzma         => 14, Terse     => 18, Lz77     => 19, Wavpack  => 97,
    Ppmd         => 98
);

grammar ZIP::LocalHeader::Grammar does Grammar::Binary {
    token TOP {
        PK \x03 \x04  # signature
        
        <format-version=.uint8(Any() => { "{Int $_/10}.{$_ % 10}" })>
        
        <file-attr-type=.uint8(Any() => ZIP::AttrType>
        
        <flags=.bitfield16l)>
        
        <compression=.uint16l(Any() => ZIP::Compr>
        
        <modification-date=.uint32l(Any() =>
            { DateTime.new(year   => vec($_, 25, 7),
                           month  => vec($_, 21, 4),
                           day    => vec($_, 16, 5),
                           hour   => vec($_, 11, 5),
                           minute => vec($_,  5, 6),
                           second => vec($_,  0, 5)) }
        )>
        
        <crc-32=.uint32l>
        
        <compressed-size=.uint32l>
        
        <uncompressed-size=.uint32l>
        
        <_filename-size=.uint16l>
        
        <_extra-header-size=.uint16l>
        
        <filename=.str(:bytes($<_filename-size>.made),
                        :enc($<flags>.made.[11] ?? 'utf8' !! 'cp437'))>
        
        <extra-header=.buf(:bytes($<_extra-header-size>.made))>
        
        
        { make ZIP::LocalHeader.new:
            |$/.hash.map({ next if .key ~~ /^_/; .key => .value.made }) }
    }
}


#| From the binary representation of a given number, return a "substring"
#| of bits as a new number, e.g. vec(150, 1, 4) returns 11:
#|     150 = 0b10010110
#|                ^^^^ = 0b1011 = 11

sub vec ($number, $rindex, $bits) {
    $number +> $rindex +& (2 ** $bits - 1)
}
Unlike the ICO grammar, this one is mostly about unpacking, and does not do any
validating except for matching the signature at the start. Both kinds of
use-cases are common, so it's good to have an example of each.
With a separate action class

Instead of make'ing all result values directly in the grammar, sometimes
it's better to factor out any post-processing beyond the immediate unpacking
step into a separate action class. (In fact some people prefer to always do
that).
The question is, how to access the unpacked values in that case? One could do:
grammar ...::Grammar ... {
    ...
    token foo       { <uint16(...)> }
}

class ...::Actions {
    ...
    method foo ($/) { make some-transformation( $<uint16>.made ) }
}
But that's cumbersome, requires the action method to know which unpack type the
grammar method used, and is inefficient due to the extra named capture that
needs to be created and passed around.
It could be written more elegantly if there were a way for the foo token to
specify that (as far as $/ is concerned) it wants to be <uint16(...)>, and
not just contain that as a subrule. Off the top of my head, here are some
possibilities for new syntax which could be allowed for this:

$/ = <.uint16(...)> -- like $<foo> = <.uint16(...)> in the parent token
<goto uint16(...)> -- akin to goto &sub in Perl 5
<=uint16(...)>
<as uint16(...)>

I think I like the fourth syntax best, so assuming that existed, the grammar and
action class could be written like this (using the same enums and sub vec as
above):
grammar ZIP::LocalHeader::Grammar does Grammar::Binary {
    token TOP {
        PK \x03 \x04  # signature
        
        <format-version>
        <file-attr-type>
        <flags>
        <compression>
        <modification-date>
        <crc-32>
        <compressed-size>
        <uncompressed-size>
        <_filename-size>
        <_extra-header-size>
        <filename( $<_filename-size>.made, $<flags>.made.[11] )>
        <extra-header( $<_extra-header-size>.made )>
    }
    
    token format-version     { <as uint8> }
    token file-attr-type     { <as uint8> }
    token flags              { <as bitfield16l> }
    token compression        { <as uint16l> }
    token modification-date  { <as uint32l> }
    token crc-32             { <as uint32l> }
    token compressed-size    { <as uint32l> }
    token uncompressed-size  { <as uint32l> }
    token _filename-size     { <as uint16l> }
    token _extra-header-size { <as uint16l> }
    token filename ($bytes, $flag11) {
        <as str(:$bytes, :enc($flag11 ?? 'utf8' !! 'cp437'))>
    }
    token extra-header ($bytes) {
        <as buf(:$bytes)>
    }
}

class ZIP::LocalHeader::Actions {
    method TOP ($/) {
        make ZIP::LocalHeader.new:
                |$/.hash.map({ next if .key ~~ /^_/; .key => .value.made })
    }
    
    method format-version ($_) {
        .make: "{ Int .made / 10 }.{ .made % 10 }"
    }
    
    method file-attr-type ($_) {
        .make: ZIP::AttrType[.made]
    }
    
    method compression ($_) {
        .make: ZIP::Compr[.made]
    }
    
    method modification-date ($/) {
        .make: DateTime.new(year   => vec(.made, 25, 7),
                            month  => vec(.made, 21, 4),
                            day    => vec(.made, 16, 5),
                            hour   => vec(.made, 11, 5),
                            minute => vec(.made,  5, 6),
                            second => vec(.made,  0, 5))
    }
}
Alternatively, maybe there needs to be an way to propose subrule aliases to
proper tokens, so that e.g. the format-version subrule can be specified via a
simple alias of the form <format-version=.uint8> inside token TOP, but still
cause ZIP::LocalHeader::Actions.format-version to fire.
The :bytes modifier/pragma

S05 doesn't go into too much detail
on the :bytes modifier that would allow regexes to operate on bytes instead of
characters.
I assume that when it is activated for a regex or grammar, it will have the
following consequences:

Grammar.parse and Regex.ACCEPTS need to be passed a Buf/Blob instead
of a Str
. matches a single byte
character literals operate on bytes:

hex character literals match the bytes with the corresponding unsigned int
values, i.e. \x00 matches the null byte
ASCII character literals match the bytes with the corresponding ASCII
representation, i.e. z matches the same byte as \x7A
non-ASCII Unicode character literals throw an error when encountered
(compile-time, if possible)


character classes operate on bytes too, e.g. <-[\x00]> matches any byte but
the null byte
The resulting Match objects provide access to matched fragments via .Buf
instead of via .Str

The hypothetical Grammar::Binary role

Letting a grammar compose this role would:

Enable the :bytes modifier for all the grammar's tokens (as a convenience).
Provide several grammar methods for parsing and unpacking (in one go) various
common binary data types.

Which methods to include should be subject to proper research and deliberation;
for now, here's a preliminary list of a bunch of methods that seem sensible:
# Cross-platform fixed size integers:

token int8    (**@rule)  #= 1 byte, as a signed integer
token int16l  (**@rule)  #= 2 bytes, as a signed integer (little-endian)
token int16b  (**@rule)  #= 2 bytes, as a signed integer (big-endian)
token int32l  (**@rule)  #= 4 bytes, as a signed integer (little-endian)
token int32b  (**@rule)  #= 4 bytes, as a signed integer (big-endian)

token uint8   (**@rule)  #= 1 byte, as an unsigned integer
token uint16l (**@rule)  #= 2 bytes, as an unsigned integer (little-endian)
token uint16b (**@rule)  #= 2 bytes, as an unsigned integer (big-endian)
token uint32l (**@rule)  #= 4 bytes, as an unsigned integer (little-endian)
token uint32b (**@rule)  #= 4 bytes, as an unsigned integer (big-endian)

# Stringy values:

token cstr (**@rule, :enc)                  #= null-terminated string
token str  (**@rule, :bytes!, :enc, :trim)  #= fixed-sized string
token buf  (**@rule, :bytes!)               #= fixed-sized buffer
The **@rule parameter can be used to specify the allowed values (for the
purposes of matching) via one or more smart-matchable templates, e.g.
<uint16l(10..50, 127, * %% 2)> would only match byte pairs whose
little-endian unsigned integer interpretation is either between 10 and 50,
or exactly 127, or an even number.
As a special case, when Pair values are passed to **@rule, only the key
is used as matcher: the value is instead used to transform unpacked values which
matched that particular matcher:


example
behavior


<.uint8(1..255, 0=>256)>
any byte is matched; it is returned as its unpacked value, except if it's a null byte in which case it is returned as number 256.


<.uint8(0=>256, Any)>
same as previous


<.uint8(0..19 => MyEnumType)>
only a byte in the range <[\x00..\x13]> is matched; its unpacked value is treated as an index to look up an enum member which to return.


The following transformation types would be allowed:

Numeric: use as a direct replacement value
Callable: pass unpacked value as parameter, and use return value
Associative: look up a member using the unpacked value as key
Positional: look up a member using the unpacked value as index
an enum type: look up a member using the unpacked value as index

Performance may be a concern here; instead of actually invoking smart-match on the
unpacked values at runtime, maybe those methods actually want to be macros
instead, so that they can transform simple cases (like Range matchers) into
grammar code at compile time:
user writes macro generates
<foo=.uint16l(1..*)>
$<foo> = ( <!before \x00 \x00> . .
           { make $/.Buf.unpack("S<") } )
<foo=.uint16l(0..300)>
$<foo> = ( [ . \x00 | <[\x00..\x2c]> \x01 ]
           { make $/.Buf.unpack("S<") } )

Of course, all of this is just idle speculation. Maybe an altogether different approach to
providing convenience functionality for binary grammars, ends up being preferable.
Another crazy idea: A quote-like operator for regex-based unpacking

For maximum convenience in simple unpacking use-cases that don't need a whole
grammar, a quote-like operator could be provided (Maybe called
u//?), which would behave like m// except that it:

Automatically enables :bytes.
Automatically makes the Grammar::Binary methods available.
Collects any unpacked values into a positional list, and returns that instead
of a Match object.

So for example this unpacking operation from the
Perl 5 docs:
my @registers = unpack 'v2 (vXXCC)5 v5', $stackframe;

...could be written more readably (albeit also more verbosely) like this:
my @registers = $stackframe ~~ u/ <uint16l> ** 2
                                  [ <uint16l> && <uint8> <uint8> ] ** 5
                                  <uint16l> ** 5 /;
example	behavior
`<.uint8(1..255, 0=>256)>`	any byte is matched; it is returned as its unpacked value, except if it's a null byte in which case it is returned as number 256.
`<.uint8(0=>256, Any)>`	same as previous
`<.uint8(0..19 => MyEnumType)>`	only a byte in the range <[\x00..\x13]> is matched; its unpacked value is treated as an index to look up an enum member which to return.
user writes	macro generates
`<foo=.uint16l(1..*)>`	`$<foo> = ( <!before \x00 \x00> . . { make $/.Buf.unpack("S<") } )`
`<foo=.uint16l(0..300)>`	$<foo> = ( [ . \x00 \| <[\x00..\x2c]> \x01 ] { make $/.Buf.unpack("S<") } )
field	size
filename_size	2 bytes
extra_header_size	2 bytes
filename	variable
extra_header	variable