ShimmerFairy/bingram.pod6

## bingram.pod6
=config C<> :allow<R>

=begin pod

=head1 Basics

Binary grammars, henceforth referred to as "formats" (taken from e.g. "file
format"), are in my mind sketched as follows:

=item Formats work on the byte level (which for sanity's sake we'll consider
      equivalent to octets). Bitfields are handled by a special syntax.

=item Formats are formatted similarly to grammars, but with a smaller feature
      set; not because we don't want to support binary parsing as much, but
      because binary data needs less (and different) tools than parsing text.

=item The main goals for formats are, in my mind:

  =item2 Pulling information at a specific position (as opposed to grammars
         pulling information after a specific set of preconditions)

  =item2 Getting data based on size constraints rather than contents. That is,
         you're more often concerned with getting data of a certain size than
         getting data matching a particular pattern.

  =item2 Choices are often based on previously-collected information. In other
         words, alternatives are usually known to appear at a certain location.

  =item2 Strings of text should be matchable to regexes, but formats shouldn't
         inherit all of the grammars just to do that.

=begin item

Everything in formats, as I sketch them, are based on written order. For
multi-byte numbers, this means big-endian is assumed (though we of course let
you handle source endianness easily). Bitfields are written with the most
significant bits first, and so on.

However, since 'big-endian' has a more narrow meaning than we want here, and
because there are numerous differences between endian schemes, we'll instead
refer to "written order" in this document. For example, given the 32-bit number
C<0xFACE_F00D>:

    =begin table
      Endianness  Written Order
    ============  ================
             Big  :16{FA CE F0 0D}
     Little on 8  :16{0D F0 CE FA}
    Little on 16  :16{F0 0D FA CE}
    "PDP-Endian"  :16{CE FA 0D F0}
    =end table

In each of these examples, the "Written Order" column shows how the bytes
would have to be written out literally to match C<0xFACE_F00D> in the given
endianness.

Simply put, write your formats in the mindset of reading increasing positions in
the data.

=end item

=head1 Literal Values

=head2 Numbers

The most basic thing is matching byte literals:

    42  0x21 0o12 0b1110111

Literals are assumed to be of the smallest number of bytes that can contain the
number, and that number of bytes is matched. E.g.

    255

would be a 1-byte number (0xFF), while

    256

would be a 2-byte number (0x100, aka 0x01_00)

Remember that formats go by written order, so these two lines would match the
same thing:

    f/0xFACE/
    f/0xFA 0xCE/

(see {{{below}}} on what C<f//> means)

However, there is the C«<le>» special rule that helps ease writing little-endian
sequences, see {{{below}}}.

=head2 Strings

Literal text can be matched with string literals, specified as would any Q-lang
string:

    "abc" Q:s[def $ghi]

These strings are taken as NFG strings unless an appropriate adverb is given
(C<Q:nfc> and the like is the main reason for allowing any standard Q string
formats). These NFG strings are then taken down to NFC level, and then converted
to utf-8 bytes.

To change how text is encoded, use one of the C<:utf16be>, C<:utf16le>,
C<:utf32be>, or C<:utf32le> adverbs in the format:

[Conjecture: make the 16- and 32-bit adverbs be C«:utf16<be>» instead?]

    f/"☃☄"/           # matches 0xE2_98_83 0xE2_98_84 (6 bytes)
    f/:utf16be "☃☄"/  # matches 0x2603 0x2604 (4 bytes)

It's unclear how unquoted text will work out currently. Possibly NFG-only, with
the given encoding adverbs affecting it as well.

=head1 Generic Data Matching

These rules allow for matching more general patterns than simply literals. Use
of C«<rule>» syntax like regexes is debatable. [Conjecture: we could just have
literal identifiers, since those are up in the air (see the section on literal
strings just above).]

=head2 Unsigned integers

The only primitive data type, in the opinion of formats, are unsigned integers,
in other words a certain number of bytes. Their names are programmatic, of the
format:

    =for code :allow<R>
    <uR<bits>>

Where C<R<bits>> is, predictably, the number of bits in the integer you're
taking. Implementations should reject and die on bit values that aren't a whole
number of bytes. L<#Bitfields> are the proper way to handle strange bit widths,
such as C<u9> on an octet-based system.

=head2 Signed integers

Signed integers have a similar programmatic name, though there is a method of
specifying the encoding method:

    =for code :allow<R>
    <iR<bits>>
    <iR<enc>R<bits>>

R<enc> can be any of C<2c>, C<1c>, C<sm>, or C<R<num>k>; for two's complement,
one's complement, sign-magnitude, and excess-I<k>, respectively. Note that for
excess-I<k>, you'll need to provide an additional number that is the biasing
value used.

    <i2c8>    # 8-bit signed int (two's-complement)
    <i1c16>   # 16-bit signed int (one's-complement)
    <ism4>    # 4-bit signed int (sign-magnitude)
    <i128k8>  # 8-bit signed int (excess-128)

Note that the default is two's-complement, so these two lines are equivalent:

    f/<i2c8>/
    f/<i8>/

=head2 Text strings

There are three different kinds of strings that formats provide ways to match:

=defn C«<nullstr R<char-size>>»
Matches a standard C-style null-terminated string. In other words, will keep
taking R<char-size>-bit numbers until a null number is reached. R<char-size> is
an optional bitwidth, by default C<8> (aka C<u8>, aka C<char> to most systems' C
compilers).

=defn C«<sizedstr R<int-type> R<char-size>>»
Matches a string that begins with a size parameter, indicating the number of
characters in the string. The required R<int-type> parameter is one of the
previously shown signed or unsigned integer names. (A negative length found for
the string will cause an error however, so sticking to the unsigned names is
recommended.) The optional R<char-size> parameter specifies the width of a
character, by default C<8>.

=defn C«<fixedstr R<chars> R<char-size>>» Matches a string that's R<chars>
characters long. Meant for strings with a fixed length (hence the name), found
in e.g. file headers. R<char-size> is optional, and specifies the bitwidth of a
character. It's C<8> by default.

For all these rules, implementations should reject non-byte-sized bitwidths
given as the R<char-size>. Each of these will return a C<Blob> in their C<.ast>s
(see {{{below}}}). It's up to you to decode these strings appropriately.

=head2 Named capture

The primary way to collect data in a format should be through named
captures. They look like their regex counterparts:

    $<version>=[<u8>]
    $<version>=(<u8>)
    <version=u8>

With the same behaviors as each of those variations.

=head2 Capture Series

When you need to capture a series of values of the same datatype, use the
"capture series" construct:

    =for code :allow<R>
    <,R<int-type> R<name1> R<name2> ... R<nameN>>

This helps clarify a section of binary data that's just a series of values in
one type. For example:

    <,u8 v-major v-minor v-revision>

This will capture three C<u8> values in written order, one each for the given
names. This is equivalent to

    <v-major=u8>
    <v-minor=u8>
    <v-revision=u8>

but shorter, and easier to update.

=head2 Bitfields

To match a collection of flags, or other kinds of data organized at the bit
level, use the bitfield construct:

    =for code :allow<R>
    <:R<bit-size> R<flag-name1> R<flag-name2> ... R<flag-nameN>>

This version of the construct names a collection of bit flags, in written
order. R<bit-size> is the total size of the bitfield. Non-whole-byte sizes of
R<bit-size> should be rejected by the compiler.

Bits that you want to skip can have a C<*> specified for them. To skip the rest
of the bytes, use a C<**>.

As an example, for the number C<0x90> (binary C<0x1001_0000>), the bit field
would look like such:

    <:8 high_flag * * flag2 **>

There is a second form of the bitfield construct, shown as such:

    =for code :allow<R>
    <:R<bit-size>,R<int-type> R<name1> R<name2> ,R<int-type2> R<name3> ... R<nameN>>

This version allows grabbing bit-packed values. These examples will extract the
parts of various IEEE floating-point formats:

    <:32,u1 sign ,i127k8 exponent ,u23 fraction>           # binary32
    <:128,u1 sign ,i16383k15 exponent ,u112 fraction>      # binary128
    <:64,u1 sign ,u5 combined ,u8 expcont ,u50 coeffcont>  # decimal64

Putting C<Whatever>s in this form will skip the number of bits of the current
type. Note that the first form essentially is essentially the second form, but
with an implied C<,u1> after the R<bit-size>. Note that the first and second
forms are separate, meaning you can't specify a data type in the middle of a
first form:

    <:32 flag1 flag2 ,u30 number>     # WRONG, will fail
    <:32,u1 flag1 flag2 ,u30 number>  # OK

In both forms, leaving out the necessary parameters, C<Whatever>s, or
C<HyperWhatever> to use all R<bit-size> bits will result in the compiler
assuming an implicit C<**> at the end of the list, with a warning. Specifying
too many items will cause a fatal error.

    <:16,u4 high_half low_half>    # assumed ** after low_half, and warning
    <:8,u4 high_half low_half * *> # dies, too many parameters
    <:16,u4 high_half low_half **> # OK, and no warnings

=head1 Other constructs

=head2 C«<le>»

The C<le> construct helps with matching literal numbers that's not in written
order. The form is

    =for code :allow<R>
    <le R<bits-per-unit> R<number literals>>

This takes the string of bytes formed by the R<number literals> in written
order, and then reverse on R<bits-per-unit>-sized pieces. Implementations may
reject non-whole-byte bit sizes, or alternately allow them when enough whole
bytes are specified to specify a whole number of R<bits-per-unit>-sized pieces.

For example:

    f/<le 8 0xFACEF00D>/  is eqv. to f/0x0DF0CEFA/
    f/<le 16 0xFACEF00D>/ is eqv. to f/0xF00DFACE/

=head2 General subrule collection

Plenty of binary data formats have a set list of offsets in the file, which
dictate collecting information from the rest of the data. This construct is
meant to ease the specification of that setup.

    <&& address1 => subrule1, address2 => subrule2 ..., addressN => subruleN>

This specifies a list of subrules, and where they can be found in the data. For
example:

    =begin code
    <file-header>

    <&& $<file-header><dictoffset> => <dictionary>
        $<file-header><mdoffset>   => <metadata>
        $<file-header><fdoffset>   => <filedata> >
    =end code

The rule needs all the given subrules to successfully match in order to
succeed. This construct allows the format to check these subrules in whatever
order is most appropriate, since the responsibility of going everywhere lays
with the construct.

=head2 Data offset examine/control

=end pod
	=config C<> :allow<R>

	=begin pod

	=head1 Basics

	Binary grammars, henceforth referred to as "formats" (taken from e.g. "file
	format"), are in my mind sketched as follows:

	=item Formats work on the byte level (which for sanity's sake we'll consider
	equivalent to octets). Bitfields are handled by a special syntax.

	=item Formats are formatted similarly to grammars, but with a smaller feature
	set; not because we don't want to support binary parsing as much, but
	because binary data needs less (and different) tools than parsing text.

	=item The main goals for formats are, in my mind:

	=item2 Pulling information at a specific position (as opposed to grammars
	pulling information after a specific set of preconditions)

	=item2 Getting data based on size constraints rather than contents. That is,
	you're more often concerned with getting data of a certain size than
	getting data matching a particular pattern.

	=item2 Choices are often based on previously-collected information. In other
	words, alternatives are usually known to appear at a certain location.

	=item2 Strings of text should be matchable to regexes, but formats shouldn't
	inherit all of the grammars just to do that.

	=begin item

	Everything in formats, as I sketch them, are based on written order. For
	multi-byte numbers, this means big-endian is assumed (though we of course let
	you handle source endianness easily). Bitfields are written with the most
	significant bits first, and so on.

	However, since 'big-endian' has a more narrow meaning than we want here, and
	because there are numerous differences between endian schemes, we'll instead
	refer to "written order" in this document. For example, given the 32-bit number
	C<0xFACE_F00D>:

	=begin table
	Endianness Written Order
	============ ================
	Big :16{FA CE F0 0D}
	Little on 8 :16{0D F0 CE FA}
	Little on 16 :16{F0 0D FA CE}
	"PDP-Endian" :16{CE FA 0D F0}
	=end table

	In each of these examples, the "Written Order" column shows how the bytes
	would have to be written out literally to match C<0xFACE_F00D> in the given
	endianness.

	Simply put, write your formats in the mindset of reading increasing positions in
	the data.

	=end item

	=head1 Literal Values

	=head2 Numbers

	The most basic thing is matching byte literals:

	42 0x21 0o12 0b1110111

	Literals are assumed to be of the smallest number of bytes that can contain the
	number, and that number of bytes is matched. E.g.

	255

	would be a 1-byte number (0xFF), while

	256

	would be a 2-byte number (0x100, aka 0x01_00)

	Remember that formats go by written order, so these two lines would match the
	same thing:

	f/0xFACE/
	f/0xFA 0xCE/

	(see {{{below}}} on what C<f//> means)

	However, there is the C«<le>» special rule that helps ease writing little-endian
	sequences, see {{{below}}}.

	=head2 Strings

	Literal text can be matched with string literals, specified as would any Q-lang
	string:

	"abc" Q:s[def $ghi]

	These strings are taken as NFG strings unless an appropriate adverb is given
	(C<Q:nfc> and the like is the main reason for allowing any standard Q string
	formats). These NFG strings are then taken down to NFC level, and then converted
	to utf-8 bytes.

	To change how text is encoded, use one of the C<:utf16be>, C<:utf16le>,
	C<:utf32be>, or C<:utf32le> adverbs in the format:

	[Conjecture: make the 16- and 32-bit adverbs be C«:utf16<be>» instead?]

	f/"☃☄"/ # matches 0xE2_98_83 0xE2_98_84 (6 bytes)
	f/:utf16be "☃☄"/ # matches 0x2603 0x2604 (4 bytes)

	It's unclear how unquoted text will work out currently. Possibly NFG-only, with
	the given encoding adverbs affecting it as well.

	=head1 Generic Data Matching

	These rules allow for matching more general patterns than simply literals. Use
	of C«<rule>» syntax like regexes is debatable. [Conjecture: we could just have
	literal identifiers, since those are up in the air (see the section on literal
	strings just above).]

	=head2 Unsigned integers

	The only primitive data type, in the opinion of formats, are unsigned integers,
	in other words a certain number of bytes. Their names are programmatic, of the
	format:

	=for code :allow<R>
	<uR<bits>>

	Where C<R<bits>> is, predictably, the number of bits in the integer you're
	taking. Implementations should reject and die on bit values that aren't a whole
	number of bytes. L<#Bitfields> are the proper way to handle strange bit widths,
	such as C<u9> on an octet-based system.

	=head2 Signed integers

	Signed integers have a similar programmatic name, though there is a method of
	specifying the encoding method:

	=for code :allow<R>
	<iR<bits>>
	<iR<enc>R<bits>>

	R<enc> can be any of C<2c>, C<1c>, C<sm>, or C<R<num>k>; for two's complement,
	one's complement, sign-magnitude, and excess-I<k>, respectively. Note that for
	excess-I<k>, you'll need to provide an additional number that is the biasing
	value used.

	<i2c8> # 8-bit signed int (two's-complement)
	<i1c16> # 16-bit signed int (one's-complement)
	<ism4> # 4-bit signed int (sign-magnitude)
	<i128k8> # 8-bit signed int (excess-128)

	Note that the default is two's-complement, so these two lines are equivalent:

	f/<i2c8>/
	f/<i8>/

	=head2 Text strings

	There are three different kinds of strings that formats provide ways to match:

	=defn C«<nullstr R<char-size>>»
	Matches a standard C-style null-terminated string. In other words, will keep
	taking R<char-size>-bit numbers until a null number is reached. R<char-size> is
	an optional bitwidth, by default C<8> (aka C<u8>, aka C<char> to most systems' C
	compilers).

	=defn C«<sizedstr R<int-type> R<char-size>>»
	Matches a string that begins with a size parameter, indicating the number of
	characters in the string. The required R<int-type> parameter is one of the
	previously shown signed or unsigned integer names. (A negative length found for
	the string will cause an error however, so sticking to the unsigned names is
	recommended.) The optional R<char-size> parameter specifies the width of a
	character, by default C<8>.

	=defn C«<fixedstr R<chars> R<char-size>>» Matches a string that's R<chars>
	characters long. Meant for strings with a fixed length (hence the name), found
	in e.g. file headers. R<char-size> is optional, and specifies the bitwidth of a
	character. It's C<8> by default.

	For all these rules, implementations should reject non-byte-sized bitwidths
	given as the R<char-size>. Each of these will return a C<Blob> in their C<.ast>s
	(see {{{below}}}). It's up to you to decode these strings appropriately.

	=head2 Named capture

	The primary way to collect data in a format should be through named
	captures. They look like their regex counterparts:

	$<version>=[<u8>]
	$<version>=(<u8>)
	<version=u8>

	With the same behaviors as each of those variations.

	=head2 Capture Series

	When you need to capture a series of values of the same datatype, use the
	"capture series" construct:

	=for code :allow<R>
	<,R<int-type> R<name1> R<name2> ... R<nameN>>

	This helps clarify a section of binary data that's just a series of values in
	one type. For example:

	<,u8 v-major v-minor v-revision>

	This will capture three C<u8> values in written order, one each for the given
	names. This is equivalent to

	<v-major=u8>
	<v-minor=u8>
	<v-revision=u8>

	but shorter, and easier to update.

	=head2 Bitfields

	To match a collection of flags, or other kinds of data organized at the bit
	level, use the bitfield construct:

	=for code :allow<R>
	<:R<bit-size> R<flag-name1> R<flag-name2> ... R<flag-nameN>>

	This version of the construct names a collection of bit flags, in written
	order. R<bit-size> is the total size of the bitfield. Non-whole-byte sizes of
	R<bit-size> should be rejected by the compiler.

	Bits that you want to skip can have a C<*> specified for them. To skip the rest
	of the bytes, use a C<**>.

	As an example, for the number C<0x90> (binary C<0x1001_0000>), the bit field
	would look like such:

	<:8 high_flag * * flag2 **>

	There is a second form of the bitfield construct, shown as such:

	=for code :allow<R>
	<:R<bit-size>,R<int-type> R<name1> R<name2> ,R<int-type2> R<name3> ... R<nameN>>

	This version allows grabbing bit-packed values. These examples will extract the
	parts of various IEEE floating-point formats:

	<:32,u1 sign ,i127k8 exponent ,u23 fraction> # binary32
	<:128,u1 sign ,i16383k15 exponent ,u112 fraction> # binary128
	<:64,u1 sign ,u5 combined ,u8 expcont ,u50 coeffcont> # decimal64

	Putting C<Whatever>s in this form will skip the number of bits of the current
	type. Note that the first form essentially is essentially the second form, but
	with an implied C<,u1> after the R<bit-size>. Note that the first and second
	forms are separate, meaning you can't specify a data type in the middle of a
	first form:

	<:32 flag1 flag2 ,u30 number> # WRONG, will fail
	<:32,u1 flag1 flag2 ,u30 number> # OK

	In both forms, leaving out the necessary parameters, C<Whatever>s, or
	C<HyperWhatever> to use all R<bit-size> bits will result in the compiler
	assuming an implicit C<**> at the end of the list, with a warning. Specifying
	too many items will cause a fatal error.

	<:16,u4 high_half low_half> # assumed ** after low_half, and warning
	<:8,u4 high_half low_half * *> # dies, too many parameters
	<:16,u4 high_half low_half **> # OK, and no warnings

	=head1 Other constructs

	=head2 C«<le>»

	The C<le> construct helps with matching literal numbers that's not in written
	order. The form is

	=for code :allow<R>
	<le R<bits-per-unit> R<number literals>>

	This takes the string of bytes formed by the R<number literals> in written
	order, and then reverse on R<bits-per-unit>-sized pieces. Implementations may
	reject non-whole-byte bit sizes, or alternately allow them when enough whole
	bytes are specified to specify a whole number of R<bits-per-unit>-sized pieces.

	For example:

	f/<le 8 0xFACEF00D>/ is eqv. to f/0x0DF0CEFA/
	f/<le 16 0xFACEF00D>/ is eqv. to f/0xF00DFACE/

	=head2 General subrule collection

	Plenty of binary data formats have a set list of offsets in the file, which
	dictate collecting information from the rest of the data. This construct is
	meant to ease the specification of that setup.

	<&& address1 => subrule1, address2 => subrule2 ..., addressN => subruleN>

	This specifies a list of subrules, and where they can be found in the data. For
	example:

	=begin code
	<file-header>

	<&& $<file-header><dictoffset> => <dictionary>
	$<file-header><mdoffset> => <metadata>
	$<file-header><fdoffset> => <filedata> >
	=end code

	The rule needs all the given subrules to successfully match in order to
	succeed. This construct allows the format to check these subrules in whatever
	order is most appropriate, since the responsibility of going everywhere lays
	with the construct.

	=head2 Data offset examine/control

	=end pod