Skip to content

Instantly share code, notes, and snippets.

@smls
Created July 25, 2015 06:31
Show Gist options
  • Save smls/bc5d0fb42f199574e339 to your computer and use it in GitHub Desktop.
Save smls/bc5d0fb42f199574e339 to your computer and use it in GitHub Desktop.

Binary Parsing in Perl 6

When parsing things like binary file format headers, you generally need to know three things for each data field:

  • how many, and what kind of, bytes to gobble
    (e.g. input stream --> Buf:0x<0b 00>)
  • how to unpack those bytes into a number, or other low-level type
    (e.g. Buf:0x<0b 00> --> 11)
  • how to transform that number into meaningful data
    (e.g. 11 --> "v1.1")

Of course, those three things are not necessarily orthogonal - for example the unpack rule often already implies the number of bytes to gobble, and the boundary between the 'unpack' and 'transform' steps is fuzzy. Also, constraints become applicable during the 'gobble' step, but are often more conveniently expressed in terms of the post-unpack value.

The traditional Perl way to do such parsing, is using imperative while/if/else logic that gobbles one or more fields with read(), unpacks them with unpack(), and transforms them with custom Perl code. This approach is flexible, but often tedious.

Enter grammars. Perl 6 grammars support all the parsing logic you could hope for, but unfortunately they operate only on opaque Unicode string characters - not on bytes. The design documents speculate that operating on the byte level will at some point be supported, but jnthn++ clarified that this feature won't make it into the first stable release of Perl 6.

Nonetheless, I think it is a good time to start thinking about this now, in particular with these goals:

  1. Determine what the "optionally operate on the byte level" feature would actually entail, and if the current Perl 6 regex & grammar stuff is flexible enough to allow adding this feature in a future Perl 6.x release without major upheaval.

  2. Determine what kind of additional features and syntactic sugar would be needed beyond that bare-bones support, to minimize redundancy between the gobble + unpack + transform steps and to make grammar-based binary parsing pleasant enough for actual use.

  3. Determine how much of the above really needs Perl 6 core additions, and how much of it could be provided by modules. (Or at least, whether modules could provide alternative interim solutions.)

This post is meant to help start that discovery process.

Case study 1: Parsing an ICO header (strictly validating)

An ICO (Windows icon) file can contain multiple bitmap images which portray the same logical icon at different sizes / bit depths. The ICO file header has an entry for each contained image (specifying the meta-data needed to select and decode it); the actual bitmap data of the images starts after the file header.

Suppose that we want to read the header of such a file, and represent each of its image entries as an object of the following class:

class ICO::ImageEntry {
    has $.width;         # between 1 and 256
    has $.heigth;        # between 1 and 256
    has $.palette-size;  # between 1 and 255, or undefined
    has $.color-planes;  # between 1 and 4
    has $.bit-depth;     # 1, 4, 8, 16, 24, 32, or undefined
    has $.data-size;
    has $.data-offset;
}

We shall do this by parsing the binary file using a grammar:

my Buf $raw = slurp "someicon.ico", :bin;
my ICO::ImageEntry @entries = ICO::Header::Grammar.subparse($raw).ast;

But what would that ICO::Header::Grammar grammar look like?

With bare-bones binary parsing support

If the speculated :bytes modifier were implemented (details below), one could probably use that together with the built-in Buf.unpack method to write the grammar like this:

grammar ICO::Header::Grammar {
    token TOP {
        :bytes
        \x00 \x00 \x01 \x00  # signature
        <image-count>
        <image-entry> ** { $<image_count>.made }
        
        { make $<image-entry>».made; }s
    }
    
    token image-count { :bytes . . { make $/.Buf.unpack('S<') } }
    
    token image-entry {
        :bytes
        <image-width> <image-height>
        <palette-size> [ \x00 | \xff ] <color-planes> <bit-depth>
        <data-size> <data-offset>
        
        { make ICO::ImageEntry.new: |$/.hash.deepmap(*.made); }
    }
    
    token image-width  {
        :bytes .
        { make $/.Buf.unpack('C') || 256  }
    }
    token image-height {
        :bytes .
        { make $/.Buf.unpack('C') || 256  }
    }
    token palette-size {
        :bytes .
        { make $/.Buf.unpack('C') || Any  }
    }
    token color-planes {
        :bytes <[\x00..\x04]> \x00
        { make $/.Buf.unpack('S<') || 1 }
    }
    token bit-depth {
        :bytes [\x01 | \x04 | \x08 | \x10 | \x18 | \x20] \x00
        { make $/.Buf.unpack('S<') || Any }
    }
    token data-size {
        :bytes <!before \x00 \x00 \x00 \x00> . . . .
        { make $/.Buf.unpack('L<') }
    }
    token data-offset {
        :bytes . . . .
        { make $/.Buf.unpack('L<') }
    }
}

Note: I made it a strictly-validating grammar, to demonstrate how various value constraints can be expressed.

With convenience methods for binary parsing & unpacking

Imagine a Grammar::Binary role (possibly provided by a CPAN module) which you can mix into your grammar, to provide a little more convenience. I'll expand on what exactly it would do in a dedicated section below; for now, a demonstration using a rewritten version of the grammar from the previous section:

grammar ICO::Header::Grammar does Grammar::Binary {
    token TOP {
        \x00 \x00 \x01 \x00  # signature
        <image-count=.uint16l(1..*)>
        <image-entry> ** { $<image_count>.made }
        
        { make $<image-entry>».made; }
    }
    
    token image-entry {
        $<width>        = <.uint8(1..255, 0=>256)>
        $<height>       = <.uint8(1..255, 0=>256)>
        $<palette-size> = <.uint8(1..255, 0=>Nil)>
        [ \x00 | \xff ]
        $<color-planes> = <.uint16l(0=>1, 1..4)>
        $<bit-depth>    = <.uint16l(0=>Nil, 1, 4, 8, 16, 24, 32)>
        $<data-size>    = <.uint32l(1..*)>
        $<data-offset>  = <.uint32l>
        
        { make ICO::ImageEntry.new: |$/.hash.deepmap(*.made); }
    }
}

Definitely shorter, and I think also easier to read and maintain.

Case study 2: Parsing a ZIP local header (non-validating)

ZIP archives are structured quite differently from ICO icons. The central index of contained files is near the end of the stream, not at the beginning, and within the stream each contained file is directly prefixed by a "local file header" which could be represented by an object of the following class:

class ZIP::LocalHeader {
    has $.format-version;    # e.g. "1.0"
    has $.file-attr-type;    # e.g. ZIP::AttrType::MS-DOS
    has $.flags;             # e.g. [?«<0 0 0 1 0 0 1 0 0 0 0 1 1 0 0 0>]
    has $.compression;       # e.g. ZIP::Compr::Deflate
    has $.modification-date; # e.g. DateTime.new(...)
    has $.crc-32;            # e.g. Buf.new(...)
    has $.compressed-size;   # e.g. 283618
    has $.uncompressed-size; # e.g. 357323
    has $.filename;          # e.g. "report.pdf"
    has $.extra-header;      # e.g. Buf.new(...)
}

Unlike in ICO::ImageEntry, the fields aren't all numbers: They're a mix of numbers, strings, buffers, a date, an array, and enum types.

For the sake of simplicity, let's say we only want to parse the first embedded file's local header, which is found right at the start of the zip archive:

my Buf $raw = slurp "archive.zip", :bin;
my ZIP::LocalHeader $firstheader = ZIP::LocalHeader::Grammar.subparse($raw).ast;

With convenience methods for binary parsing & unpacking

Assuming the same Grammar::Binary role as before, the grammar could look like this:

enum ZIP::AttrType <
    MS-DOS       Amiga OpenVMS   UNIX       VM_CMS
    Atari_ST     OS_2  Macintosh Z-System   CP_M
    Windows_NTFS MVS   VSE       Acorn Risc VFAT
    alt_MVS      BeOS  Tandem    OS_400     OS_X
>;

enum ZIP::Compr (
    Uncompressed =>  0, Shrink    =>  1, Reduce1  =>  2, Reduce2  =>  3,
    Reduce3      =>  4, Reduce4   =>  5, Implode  =>  6, Tokenize =>  7,
    Deflate      =>  8, Deflate64 =>  9, ImplodeX => 10, Bzip2    => 12,
    Lzma         => 14, Terse     => 18, Lz77     => 19, Wavpack  => 97,
    Ppmd         => 98
);

grammar ZIP::LocalHeader::Grammar does Grammar::Binary {
    token TOP {
        PK \x03 \x04  # signature
        
        <format-version=.uint8(Any() => { "{Int $_/10}.{$_ % 10}" })>
        
        <file-attr-type=.uint8(Any() => ZIP::AttrType>
        
        <flags=.bitfield16l)>
        
        <compression=.uint16l(Any() => ZIP::Compr>
        
        <modification-date=.uint32l(Any() =>
            { DateTime.new(year   => vec($_, 25, 7),
                           month  => vec($_, 21, 4),
                           day    => vec($_, 16, 5),
                           hour   => vec($_, 11, 5),
                           minute => vec($_,  5, 6),
                           second => vec($_,  0, 5)) }
        )>
        
        <crc-32=.uint32l>
        
        <compressed-size=.uint32l>
        
        <uncompressed-size=.uint32l>
        
        <_filename-size=.uint16l>
        
        <_extra-header-size=.uint16l>
        
        <filename=.str(:bytes($<_filename-size>.made),
                        :enc($<flags>.made.[11] ?? 'utf8' !! 'cp437'))>
        
        <extra-header=.buf(:bytes($<_extra-header-size>.made))>
        
        
        { make ZIP::LocalHeader.new:
            |$/.hash.map({ next if .key ~~ /^_/; .key => .value.made }) }
    }
}


#| From the binary representation of a given number, return a "substring"
#| of bits as a new number, e.g. vec(150, 1, 4) returns 11:
#|     150 = 0b10010110
#|                ^^^^ = 0b1011 = 11

sub vec ($number, $rindex, $bits) {
    $number +> $rindex +& (2 ** $bits - 1)
}

Unlike the ICO grammar, this one is mostly about unpacking, and does not do any validating except for matching the signature at the start. Both kinds of use-cases are common, so it's good to have an example of each.

With a separate action class

Instead of make'ing all result values directly in the grammar, sometimes it's better to factor out any post-processing beyond the immediate unpacking step into a separate action class. (In fact some people prefer to always do that).

The question is, how to access the unpacked values in that case? One could do:

grammar ...::Grammar ... {
    ...
    token foo       { <uint16(...)> }
}

class ...::Actions {
    ...
    method foo ($/) { make some-transformation( $<uint16>.made ) }
}

But that's cumbersome, requires the action method to know which unpack type the grammar method used, and is inefficient due to the extra named capture that needs to be created and passed around.

It could be written more elegantly if there were a way for the foo token to specify that (as far as $/ is concerned) it wants to be <uint16(...)>, and not just contain that as a subrule. Off the top of my head, here are some possibilities for new syntax which could be allowed for this:

  • $/ = <.uint16(...)> -- like $<foo> = <.uint16(...)> in the parent token
  • <goto uint16(...)> -- akin to goto &sub in Perl 5
  • <=uint16(...)>
  • <as uint16(...)>

I think I like the fourth syntax best, so assuming that existed, the grammar and action class could be written like this (using the same enums and sub vec as above):

grammar ZIP::LocalHeader::Grammar does Grammar::Binary {
    token TOP {
        PK \x03 \x04  # signature
        
        <format-version>
        <file-attr-type>
        <flags>
        <compression>
        <modification-date>
        <crc-32>
        <compressed-size>
        <uncompressed-size>
        <_filename-size>
        <_extra-header-size>
        <filename( $<_filename-size>.made, $<flags>.made.[11] )>
        <extra-header( $<_extra-header-size>.made )>
    }
    
    token format-version     { <as uint8> }
    token file-attr-type     { <as uint8> }
    token flags              { <as bitfield16l> }
    token compression        { <as uint16l> }
    token modification-date  { <as uint32l> }
    token crc-32             { <as uint32l> }
    token compressed-size    { <as uint32l> }
    token uncompressed-size  { <as uint32l> }
    token _filename-size     { <as uint16l> }
    token _extra-header-size { <as uint16l> }
    token filename ($bytes, $flag11) {
        <as str(:$bytes, :enc($flag11 ?? 'utf8' !! 'cp437'))>
    }
    token extra-header ($bytes) {
        <as buf(:$bytes)>
    }
}

class ZIP::LocalHeader::Actions {
    method TOP ($/) {
        make ZIP::LocalHeader.new:
                |$/.hash.map({ next if .key ~~ /^_/; .key => .value.made })
    }
    
    method format-version ($_) {
        .make: "{ Int .made / 10 }.{ .made % 10 }"
    }
    
    method file-attr-type ($_) {
        .make: ZIP::AttrType[.made]
    }
    
    method compression ($_) {
        .make: ZIP::Compr[.made]
    }
    
    method modification-date ($/) {
        .make: DateTime.new(year   => vec(.made, 25, 7),
                            month  => vec(.made, 21, 4),
                            day    => vec(.made, 16, 5),
                            hour   => vec(.made, 11, 5),
                            minute => vec(.made,  5, 6),
                            second => vec(.made,  0, 5))
    }
}

Alternatively, maybe there needs to be an way to propose subrule aliases to proper tokens, so that e.g. the format-version subrule can be specified via a simple alias of the form <format-version=.uint8> inside token TOP, but still cause ZIP::LocalHeader::Actions.format-version to fire.

The :bytes modifier/pragma

S05 doesn't go into too much detail on the :bytes modifier that would allow regexes to operate on bytes instead of characters.

I assume that when it is activated for a regex or grammar, it will have the following consequences:

  • Grammar.parse and Regex.ACCEPTS need to be passed a Buf/Blob instead of a Str
  • . matches a single byte
  • character literals operate on bytes:
    • hex character literals match the bytes with the corresponding unsigned int values, i.e. \x00 matches the null byte
    • ASCII character literals match the bytes with the corresponding ASCII representation, i.e. z matches the same byte as \x7A
    • non-ASCII Unicode character literals throw an error when encountered (compile-time, if possible)
  • character classes operate on bytes too, e.g. <-[\x00]> matches any byte but the null byte
  • The resulting Match objects provide access to matched fragments via .Buf instead of via .Str

The hypothetical Grammar::Binary role

Letting a grammar compose this role would:

  • Enable the :bytes modifier for all the grammar's tokens (as a convenience).
  • Provide several grammar methods for parsing and unpacking (in one go) various common binary data types.

Which methods to include should be subject to proper research and deliberation; for now, here's a preliminary list of a bunch of methods that seem sensible:

# Cross-platform fixed size integers:

token int8    (**@rule)  #= 1 byte, as a signed integer
token int16l  (**@rule)  #= 2 bytes, as a signed integer (little-endian)
token int16b  (**@rule)  #= 2 bytes, as a signed integer (big-endian)
token int32l  (**@rule)  #= 4 bytes, as a signed integer (little-endian)
token int32b  (**@rule)  #= 4 bytes, as a signed integer (big-endian)

token uint8   (**@rule)  #= 1 byte, as an unsigned integer
token uint16l (**@rule)  #= 2 bytes, as an unsigned integer (little-endian)
token uint16b (**@rule)  #= 2 bytes, as an unsigned integer (big-endian)
token uint32l (**@rule)  #= 4 bytes, as an unsigned integer (little-endian)
token uint32b (**@rule)  #= 4 bytes, as an unsigned integer (big-endian)

# Stringy values:

token cstr (**@rule, :enc)                  #= null-terminated string
token str  (**@rule, :bytes!, :enc, :trim)  #= fixed-sized string
token buf  (**@rule, :bytes!)               #= fixed-sized buffer

The **@rule parameter can be used to specify the allowed values (for the purposes of matching) via one or more smart-matchable templates, e.g. <uint16l(10..50, 127, * %% 2)> would only match byte pairs whose little-endian unsigned integer interpretation is either between 10 and 50, or exactly 127, or an even number.

As a special case, when Pair values are passed to **@rule, only the key is used as matcher: the value is instead used to transform unpacked values which matched that particular matcher:

example behavior
<.uint8(1..255, 0=>256)> any byte is matched; it is returned as its unpacked value, except if it's a null byte in which case it is returned as number 256.
<.uint8(0=>256, Any)> same as previous
<.uint8(0..19 => MyEnumType)> only a byte in the range <[\x00..\x13]> is matched; its unpacked value is treated as an index to look up an enum member which to return.

The following transformation types would be allowed:

  • Numeric: use as a direct replacement value
  • Callable: pass unpacked value as parameter, and use return value
  • Associative: look up a member using the unpacked value as key
  • Positional: look up a member using the unpacked value as index
  • an enum type: look up a member using the unpacked value as index

Performance may be a concern here; instead of actually invoking smart-match on the unpacked values at runtime, maybe those methods actually want to be macros instead, so that they can transform simple cases (like Range matchers) into grammar code at compile time:

user writesmacro generates
<foo=.uint16l(1..*)>
$<foo> = ( <!before \x00 \x00> . .
           { make $/.Buf.unpack("S<") } )
<foo=.uint16l(0..300)>
$<foo> = ( [ . \x00 | <[\x00..\x2c]> \x01 ]
           { make $/.Buf.unpack("S<") } )

Of course, all of this is just idle speculation. Maybe an altogether different approach to providing convenience functionality for binary grammars, ends up being preferable.

Another crazy idea: A quote-like operator for regex-based unpacking

For maximum convenience in simple unpacking use-cases that don't need a whole grammar, a quote-like operator could be provided (Maybe called u//?), which would behave like m// except that it:

  • Automatically enables :bytes.
  • Automatically makes the Grammar::Binary methods available.
  • Collects any unpacked values into a positional list, and returns that instead of a Match object.

So for example this unpacking operation from the Perl 5 docs:

my @registers = unpack 'v2 (vXXCC)5 v5', $stackframe;

...could be written more readably (albeit also more verbosely) like this:

my @registers = $stackframe ~~ u/ <uint16l> ** 2
                                  [ <uint16l> && <uint8> <uint8> ] ** 5
                                  <uint16l> ** 5 /;
@skids
Copy link

skids commented Jul 26, 2015

I think a lot of these issues are shared with native data structures as well, and are not limited to parsing. The packed data structure support in the core spec + NativeCall is a good first step, but to do anything with a binary structure that varies in size like a TLV, especially with write access, currently will require a lot of application-side code, and has the potential to explode into boilerplate.

I'd like to see these issues worked out (even down to the level of having things like a uint16le native type perhaps, even if it can only apply to object/structure fields) and I think once they are worked out, a lot of the above requirements could be acheived simply by defining a class, and then when parsing parent data blobs, having a rule that simply says "and now parse an instance of this compact perl class."

That way the structural meta-information goes in the class rather than inline in the regex, and can be used in serialization and REPR access routines as well as deserialization.

@smls
Copy link
Author

smls commented Jul 26, 2015

@skids: That does sound like an interesting approach.

Having a declarative way to describe binary structures that could be used for both reading and writing, would definitely be cool.

However, I doubt it could replace the generic grammar-based binary parsing support completely, in particular with respect to the following:

  • How would the "transform" step be handled? I.e. what would turn the uint16 value 11 into the version string v1.1, or the uint32 value 904635223 into the datetime 2007-02-24, 14:33:06? I think it's nice to keep the information how to interpret a value, together with the information how to pack/unpack it.

  • Do you think a data type for TLV fields could handle slightly more unusual cases such as the ZIP header shown above, where the a variable-length field is not immediately preceded by its length:

    fieldsize
    filename_size2 bytes
    extra_header_size2 bytes
    filenamevariable
    extra_headervariable

    A Grammar can easily handle that, but a rigid data-structure based solution would have trouble including all possible cases of "non-standard" layouts and complexities.

  • What if, instead of merely unpacking a structure whose precise offset and layout in the input buffer is known, you actually need to search for it in the input buffer, or parse an incompletely known structure, or parse one of multiple different possible structures? In that case, you'd still want a regex/grammar right?

Also, the needs of NativeCall are somewhat different from the needs of parsing binary formats, e.g. data structures tend to be subject to memory-aligned in the former case but not the latter. Your proposed "compact structure" classes would have to be able to handle both cases I guess?

PS: One more thing: Do you envision those "compact structure" classes to be a copy of the data from the input stream, or actually references into it? Think of a case like wanting to modify some header fields of a binary file - would that involve parsing the header into a "struct" class, modifying it, and then manually splicing the struct back into the larger data buffer - or would it update automatically?

@skids
Copy link

skids commented Jul 26, 2015

I do think you'd still want some level of binary matching capabilities to deal with the interstices in raw packets, for example, which you just want to throw away, but things like the "transform" are the part I'm most inclined to put over on the object side. e.g. a packed .version or .date attribute would be of its native/packed structure, but have accessors at a standard place saying what to generate from it (and a trait or something to turn that behavior on.) That way, the transform to an object is lazy and not done unless used.

As to what to use for the throwaway portion of the data, we would need some rx syntax for the simple stuff but it might end up that a way to create a quick anonymous "packed class" inline in the rx would be preferable. But I'm not sure of that, it's hard to predict that far forward without some worked examples, and I'm inclined to think some parts lke what you've proposed would still be needed.

(One advantage to the object approach is that it's a little more introspect-friendly than an rx where all you get is a parse tree of the expressions.)

For the zip field question, I'd envision a way to pull together an object definition that was flexible enough to handle that, yes, and it is not an unusual case actually, more the norm. A lot of network protocols have a packed field containing bits saying "this structure is present" or even length fields mixed in with those bits., before the actual structures, but not necessarily right before them. A "TLV" data structure would not of course handle that because TLVs aren't that, it would be its own thing.

Just a quick spitball it would be something like:

class myZipHeader is REPR('PointerPacker') does UnalignedStruct {
has uint16 $.filename_size is endian;
has uint16 $.extra_header_size is endian;
has buf8 $.filename is size { self.filename_size }
...
method my_filename_as_str (--> Str) is aliasof<.filename> { ... }
}

...though that particular case is common enough it might have sugar but that would be the longhand. Things could go deep into MOP/slang for sugar where we start getting that "too much boilerplate" feeling.
Like rther than traits and roles the above might just be a "unalignedstruct {...}" instead of a "class {...}".

For searching for structures, what we'd do is teach the rx compiler to introspect the special "PointerPacker" classes when it encounters them in a rule, and automatically build a rule that will only match a valid instance of that object. But yes, as I said above, there may need to be a bit of extra glue made available to make skipping over cruft easy.

The data access scenarios are a complex subject on their own. We don't even have themm worked out for higher level stuff yet (e.g.https://gist.github.com/skids/aabd2aad3d0b5ad8481b.) You may be doing 0-copy and want things to explode if any attempt to change the size of the buffer is made, or you may want to support shrinking but not growing, or a copy-on-write with some attached procedure on how to commit the final serialization, or a cache-things-at-an-HLL-level fancy copy-on-write. A sensible system might be to control that by mixing in magic roles (like the UnalignedStruct above) or something. And then there is nesting on top of all that, and maybe the buffer itself has some alignment/endian rules. It can get hairy fast, which is why I'm less optimistic about doing all of it inline in rx.

@skids
Copy link

skids commented Jul 26, 2015

Oh one thing I forgot to mention is this is actually starting to look like it may become an essential/important feature if the "load network packets directly to general purpose CPU cache" crowd wins out over the "make network NICs really smart" crowd.

@avuserow
Copy link

avuserow commented Sep 4, 2016

@skids, @smls: I stumbled upon this gist when doing research for my Binary::Structured module. I'd like to point to it (https://github.com/avuserow/perl6-binary-structured), the Python module construct (https://github.com/construct/construct/ -- a big source of inspiration), and an answer on Stackoverflow that sent me in this direction (http://stackoverflow.com/a/26893801). They all go the method that skids described in the second reply above this one, and provide some answers on how to do this.

I'm a relative newcomer to parsing and emitting binaries, but some of my current projects are exposing me to several interesting binary file formats, so I'll continue to enhance Binary::Structured to support them as a good real world test. Especially the part about re-serializing the data after a change -- that direction is not something that grammars are good at.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment