Skip to content

Instantly share code, notes, and snippets.

@ShimmerFairy
Last active October 2, 2015 09:26
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ShimmerFairy/72a25f87939aab077158 to your computer and use it in GitHub Desktop.
Save ShimmerFairy/72a25f87939aab077158 to your computer and use it in GitHub Desktop.
My thoughts on possible binary grammars in P6 (incomplete)
=config C<> :allow<R>
=begin pod
=head1 Basics
Binary grammars, henceforth referred to as "formats" (taken from e.g. "file
format"), are in my mind sketched as follows:
=item Formats work on the byte level (which for sanity's sake we'll consider
equivalent to octets). Bitfields are handled by a special syntax.
=item Formats are formatted similarly to grammars, but with a smaller feature
set; not because we don't want to support binary parsing as much, but
because binary data needs less (and different) tools than parsing text.
=item The main goals for formats are, in my mind:
=item2 Pulling information at a specific position (as opposed to grammars
pulling information after a specific set of preconditions)
=item2 Getting data based on size constraints rather than contents. That is,
you're more often concerned with getting data of a certain size than
getting data matching a particular pattern.
=item2 Choices are often based on previously-collected information. In other
words, alternatives are usually known to appear at a certain location.
=item2 Strings of text should be matchable to regexes, but formats shouldn't
inherit all of the grammars just to do that.
=begin item
Everything in formats, as I sketch them, are based on written order. For
multi-byte numbers, this means big-endian is assumed (though we of course let
you handle source endianness easily). Bitfields are written with the most
significant bits first, and so on.
However, since 'big-endian' has a more narrow meaning than we want here, and
because there are numerous differences between endian schemes, we'll instead
refer to "written order" in this document. For example, given the 32-bit number
C<0xFACE_F00D>:
=begin table
Endianness Written Order
============ ================
Big :16{FA CE F0 0D}
Little on 8 :16{0D F0 CE FA}
Little on 16 :16{F0 0D FA CE}
"PDP-Endian" :16{CE FA 0D F0}
=end table
In each of these examples, the "Written Order" column shows how the bytes
would have to be written out literally to match C<0xFACE_F00D> in the given
endianness.
Simply put, write your formats in the mindset of reading increasing positions in
the data.
=end item
=head1 Literal Values
=head2 Numbers
The most basic thing is matching byte literals:
42 0x21 0o12 0b1110111
Literals are assumed to be of the smallest number of bytes that can contain the
number, and that number of bytes is matched. E.g.
255
would be a 1-byte number (0xFF), while
256
would be a 2-byte number (0x100, aka 0x01_00)
Remember that formats go by written order, so these two lines would match the
same thing:
f/0xFACE/
f/0xFA 0xCE/
(see {{{below}}} on what C<f//> means)
However, there is the C«<le>» special rule that helps ease writing little-endian
sequences, see {{{below}}}.
=head2 Strings
Literal text can be matched with string literals, specified as would any Q-lang
string:
"abc" Q:s[def $ghi]
These strings are taken as NFG strings unless an appropriate adverb is given
(C<Q:nfc> and the like is the main reason for allowing any standard Q string
formats). These NFG strings are then taken down to NFC level, and then converted
to utf-8 bytes.
To change how text is encoded, use one of the C<:utf16be>, C<:utf16le>,
C<:utf32be>, or C<:utf32le> adverbs in the format:
[Conjecture: make the 16- and 32-bit adverbs be C«:utf16<be>» instead?]
f/"☃☄"/ # matches 0xE2_98_83 0xE2_98_84 (6 bytes)
f/:utf16be "☃☄"/ # matches 0x2603 0x2604 (4 bytes)
It's unclear how unquoted text will work out currently. Possibly NFG-only, with
the given encoding adverbs affecting it as well.
=head1 Generic Data Matching
These rules allow for matching more general patterns than simply literals. Use
of C«<rule>» syntax like regexes is debatable. [Conjecture: we could just have
literal identifiers, since those are up in the air (see the section on literal
strings just above).]
=head2 Unsigned integers
The only primitive data type, in the opinion of formats, are unsigned integers,
in other words a certain number of bytes. Their names are programmatic, of the
format:
=for code :allow<R>
<uR<bits>>
Where C<R<bits>> is, predictably, the number of bits in the integer you're
taking. Implementations should reject and die on bit values that aren't a whole
number of bytes. L<#Bitfields> are the proper way to handle strange bit widths,
such as C<u9> on an octet-based system.
=head2 Signed integers
Signed integers have a similar programmatic name, though there is a method of
specifying the encoding method:
=for code :allow<R>
<iR<bits>>
<iR<enc>R<bits>>
R<enc> can be any of C<2c>, C<1c>, C<sm>, or C<R<num>k>; for two's complement,
one's complement, sign-magnitude, and excess-I<k>, respectively. Note that for
excess-I<k>, you'll need to provide an additional number that is the biasing
value used.
<i2c8> # 8-bit signed int (two's-complement)
<i1c16> # 16-bit signed int (one's-complement)
<ism4> # 4-bit signed int (sign-magnitude)
<i128k8> # 8-bit signed int (excess-128)
Note that the default is two's-complement, so these two lines are equivalent:
f/<i2c8>/
f/<i8>/
=head2 Text strings
There are three different kinds of strings that formats provide ways to match:
=defn C«<nullstr R<char-size>>»
Matches a standard C-style null-terminated string. In other words, will keep
taking R<char-size>-bit numbers until a null number is reached. R<char-size> is
an optional bitwidth, by default C<8> (aka C<u8>, aka C<char> to most systems' C
compilers).
=defn C«<sizedstr R<int-type> R<char-size>>»
Matches a string that begins with a size parameter, indicating the number of
characters in the string. The required R<int-type> parameter is one of the
previously shown signed or unsigned integer names. (A negative length found for
the string will cause an error however, so sticking to the unsigned names is
recommended.) The optional R<char-size> parameter specifies the width of a
character, by default C<8>.
=defn C«<fixedstr R<chars> R<char-size>>» Matches a string that's R<chars>
characters long. Meant for strings with a fixed length (hence the name), found
in e.g. file headers. R<char-size> is optional, and specifies the bitwidth of a
character. It's C<8> by default.
For all these rules, implementations should reject non-byte-sized bitwidths
given as the R<char-size>. Each of these will return a C<Blob> in their C<.ast>s
(see {{{below}}}). It's up to you to decode these strings appropriately.
=head2 Named capture
The primary way to collect data in a format should be through named
captures. They look like their regex counterparts:
$<version>=[<u8>]
$<version>=(<u8>)
<version=u8>
With the same behaviors as each of those variations.
=head2 Capture Series
When you need to capture a series of values of the same datatype, use the
"capture series" construct:
=for code :allow<R>
<,R<int-type> R<name1> R<name2> ... R<nameN>>
This helps clarify a section of binary data that's just a series of values in
one type. For example:
<,u8 v-major v-minor v-revision>
This will capture three C<u8> values in written order, one each for the given
names. This is equivalent to
<v-major=u8>
<v-minor=u8>
<v-revision=u8>
but shorter, and easier to update.
=head2 Bitfields
To match a collection of flags, or other kinds of data organized at the bit
level, use the bitfield construct:
=for code :allow<R>
<:R<bit-size> R<flag-name1> R<flag-name2> ... R<flag-nameN>>
This version of the construct names a collection of bit flags, in written
order. R<bit-size> is the total size of the bitfield. Non-whole-byte sizes of
R<bit-size> should be rejected by the compiler.
Bits that you want to skip can have a C<*> specified for them. To skip the rest
of the bytes, use a C<**>.
As an example, for the number C<0x90> (binary C<0x1001_0000>), the bit field
would look like such:
<:8 high_flag * * flag2 **>
There is a second form of the bitfield construct, shown as such:
=for code :allow<R>
<:R<bit-size>,R<int-type> R<name1> R<name2> ,R<int-type2> R<name3> ... R<nameN>>
This version allows grabbing bit-packed values. These examples will extract the
parts of various IEEE floating-point formats:
<:32,u1 sign ,i127k8 exponent ,u23 fraction> # binary32
<:128,u1 sign ,i16383k15 exponent ,u112 fraction> # binary128
<:64,u1 sign ,u5 combined ,u8 expcont ,u50 coeffcont> # decimal64
Putting C<Whatever>s in this form will skip the number of bits of the current
type. Note that the first form essentially is essentially the second form, but
with an implied C<,u1> after the R<bit-size>. Note that the first and second
forms are separate, meaning you can't specify a data type in the middle of a
first form:
<:32 flag1 flag2 ,u30 number> # WRONG, will fail
<:32,u1 flag1 flag2 ,u30 number> # OK
In both forms, leaving out the necessary parameters, C<Whatever>s, or
C<HyperWhatever> to use all R<bit-size> bits will result in the compiler
assuming an implicit C<**> at the end of the list, with a warning. Specifying
too many items will cause a fatal error.
<:16,u4 high_half low_half> # assumed ** after low_half, and warning
<:8,u4 high_half low_half * *> # dies, too many parameters
<:16,u4 high_half low_half **> # OK, and no warnings
=head1 Other constructs
=head2 C«<le>»
The C<le> construct helps with matching literal numbers that's not in written
order. The form is
=for code :allow<R>
<le R<bits-per-unit> R<number literals>>
This takes the string of bytes formed by the R<number literals> in written
order, and then reverse on R<bits-per-unit>-sized pieces. Implementations may
reject non-whole-byte bit sizes, or alternately allow them when enough whole
bytes are specified to specify a whole number of R<bits-per-unit>-sized pieces.
For example:
f/<le 8 0xFACEF00D>/ is eqv. to f/0x0DF0CEFA/
f/<le 16 0xFACEF00D>/ is eqv. to f/0xF00DFACE/
=head2 General subrule collection
Plenty of binary data formats have a set list of offsets in the file, which
dictate collecting information from the rest of the data. This construct is
meant to ease the specification of that setup.
<&& address1 => subrule1, address2 => subrule2 ..., addressN => subruleN>
This specifies a list of subrules, and where they can be found in the data. For
example:
=begin code
<file-header>
<&& $<file-header><dictoffset> => <dictionary>
$<file-header><mdoffset> => <metadata>
$<file-header><fdoffset> => <filedata> >
=end code
The rule needs all the given subrules to successfully match in order to
succeed. This construct allows the format to check these subrules in whatever
order is most appropriate, since the responsibility of going everywhere lays
with the construct.
=head2 Data offset examine/control
=end pod
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment