Last active
October 2, 2015 09:26
-
-
Save ShimmerFairy/72a25f87939aab077158 to your computer and use it in GitHub Desktop.
My thoughts on possible binary grammars in P6 (incomplete)
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
=config C<> :allow<R> | |
=begin pod | |
=head1 Basics | |
Binary grammars, henceforth referred to as "formats" (taken from e.g. "file | |
format"), are in my mind sketched as follows: | |
=item Formats work on the byte level (which for sanity's sake we'll consider | |
equivalent to octets). Bitfields are handled by a special syntax. | |
=item Formats are formatted similarly to grammars, but with a smaller feature | |
set; not because we don't want to support binary parsing as much, but | |
because binary data needs less (and different) tools than parsing text. | |
=item The main goals for formats are, in my mind: | |
=item2 Pulling information at a specific position (as opposed to grammars | |
pulling information after a specific set of preconditions) | |
=item2 Getting data based on size constraints rather than contents. That is, | |
you're more often concerned with getting data of a certain size than | |
getting data matching a particular pattern. | |
=item2 Choices are often based on previously-collected information. In other | |
words, alternatives are usually known to appear at a certain location. | |
=item2 Strings of text should be matchable to regexes, but formats shouldn't | |
inherit all of the grammars just to do that. | |
=begin item | |
Everything in formats, as I sketch them, are based on written order. For | |
multi-byte numbers, this means big-endian is assumed (though we of course let | |
you handle source endianness easily). Bitfields are written with the most | |
significant bits first, and so on. | |
However, since 'big-endian' has a more narrow meaning than we want here, and | |
because there are numerous differences between endian schemes, we'll instead | |
refer to "written order" in this document. For example, given the 32-bit number | |
C<0xFACE_F00D>: | |
=begin table | |
Endianness Written Order | |
============ ================ | |
Big :16{FA CE F0 0D} | |
Little on 8 :16{0D F0 CE FA} | |
Little on 16 :16{F0 0D FA CE} | |
"PDP-Endian" :16{CE FA 0D F0} | |
=end table | |
In each of these examples, the "Written Order" column shows how the bytes | |
would have to be written out literally to match C<0xFACE_F00D> in the given | |
endianness. | |
Simply put, write your formats in the mindset of reading increasing positions in | |
the data. | |
=end item | |
=head1 Literal Values | |
=head2 Numbers | |
The most basic thing is matching byte literals: | |
42 0x21 0o12 0b1110111 | |
Literals are assumed to be of the smallest number of bytes that can contain the | |
number, and that number of bytes is matched. E.g. | |
255 | |
would be a 1-byte number (0xFF), while | |
256 | |
would be a 2-byte number (0x100, aka 0x01_00) | |
Remember that formats go by written order, so these two lines would match the | |
same thing: | |
f/0xFACE/ | |
f/0xFA 0xCE/ | |
(see {{{below}}} on what C<f//> means) | |
However, there is the C«<le>» special rule that helps ease writing little-endian | |
sequences, see {{{below}}}. | |
=head2 Strings | |
Literal text can be matched with string literals, specified as would any Q-lang | |
string: | |
"abc" Q:s[def $ghi] | |
These strings are taken as NFG strings unless an appropriate adverb is given | |
(C<Q:nfc> and the like is the main reason for allowing any standard Q string | |
formats). These NFG strings are then taken down to NFC level, and then converted | |
to utf-8 bytes. | |
To change how text is encoded, use one of the C<:utf16be>, C<:utf16le>, | |
C<:utf32be>, or C<:utf32le> adverbs in the format: | |
[Conjecture: make the 16- and 32-bit adverbs be C«:utf16<be>» instead?] | |
f/"☃☄"/ # matches 0xE2_98_83 0xE2_98_84 (6 bytes) | |
f/:utf16be "☃☄"/ # matches 0x2603 0x2604 (4 bytes) | |
It's unclear how unquoted text will work out currently. Possibly NFG-only, with | |
the given encoding adverbs affecting it as well. | |
=head1 Generic Data Matching | |
These rules allow for matching more general patterns than simply literals. Use | |
of C«<rule>» syntax like regexes is debatable. [Conjecture: we could just have | |
literal identifiers, since those are up in the air (see the section on literal | |
strings just above).] | |
=head2 Unsigned integers | |
The only primitive data type, in the opinion of formats, are unsigned integers, | |
in other words a certain number of bytes. Their names are programmatic, of the | |
format: | |
=for code :allow<R> | |
<uR<bits>> | |
Where C<R<bits>> is, predictably, the number of bits in the integer you're | |
taking. Implementations should reject and die on bit values that aren't a whole | |
number of bytes. L<#Bitfields> are the proper way to handle strange bit widths, | |
such as C<u9> on an octet-based system. | |
=head2 Signed integers | |
Signed integers have a similar programmatic name, though there is a method of | |
specifying the encoding method: | |
=for code :allow<R> | |
<iR<bits>> | |
<iR<enc>R<bits>> | |
R<enc> can be any of C<2c>, C<1c>, C<sm>, or C<R<num>k>; for two's complement, | |
one's complement, sign-magnitude, and excess-I<k>, respectively. Note that for | |
excess-I<k>, you'll need to provide an additional number that is the biasing | |
value used. | |
<i2c8> # 8-bit signed int (two's-complement) | |
<i1c16> # 16-bit signed int (one's-complement) | |
<ism4> # 4-bit signed int (sign-magnitude) | |
<i128k8> # 8-bit signed int (excess-128) | |
Note that the default is two's-complement, so these two lines are equivalent: | |
f/<i2c8>/ | |
f/<i8>/ | |
=head2 Text strings | |
There are three different kinds of strings that formats provide ways to match: | |
=defn C«<nullstr R<char-size>>» | |
Matches a standard C-style null-terminated string. In other words, will keep | |
taking R<char-size>-bit numbers until a null number is reached. R<char-size> is | |
an optional bitwidth, by default C<8> (aka C<u8>, aka C<char> to most systems' C | |
compilers). | |
=defn C«<sizedstr R<int-type> R<char-size>>» | |
Matches a string that begins with a size parameter, indicating the number of | |
characters in the string. The required R<int-type> parameter is one of the | |
previously shown signed or unsigned integer names. (A negative length found for | |
the string will cause an error however, so sticking to the unsigned names is | |
recommended.) The optional R<char-size> parameter specifies the width of a | |
character, by default C<8>. | |
=defn C«<fixedstr R<chars> R<char-size>>» Matches a string that's R<chars> | |
characters long. Meant for strings with a fixed length (hence the name), found | |
in e.g. file headers. R<char-size> is optional, and specifies the bitwidth of a | |
character. It's C<8> by default. | |
For all these rules, implementations should reject non-byte-sized bitwidths | |
given as the R<char-size>. Each of these will return a C<Blob> in their C<.ast>s | |
(see {{{below}}}). It's up to you to decode these strings appropriately. | |
=head2 Named capture | |
The primary way to collect data in a format should be through named | |
captures. They look like their regex counterparts: | |
$<version>=[<u8>] | |
$<version>=(<u8>) | |
<version=u8> | |
With the same behaviors as each of those variations. | |
=head2 Capture Series | |
When you need to capture a series of values of the same datatype, use the | |
"capture series" construct: | |
=for code :allow<R> | |
<,R<int-type> R<name1> R<name2> ... R<nameN>> | |
This helps clarify a section of binary data that's just a series of values in | |
one type. For example: | |
<,u8 v-major v-minor v-revision> | |
This will capture three C<u8> values in written order, one each for the given | |
names. This is equivalent to | |
<v-major=u8> | |
<v-minor=u8> | |
<v-revision=u8> | |
but shorter, and easier to update. | |
=head2 Bitfields | |
To match a collection of flags, or other kinds of data organized at the bit | |
level, use the bitfield construct: | |
=for code :allow<R> | |
<:R<bit-size> R<flag-name1> R<flag-name2> ... R<flag-nameN>> | |
This version of the construct names a collection of bit flags, in written | |
order. R<bit-size> is the total size of the bitfield. Non-whole-byte sizes of | |
R<bit-size> should be rejected by the compiler. | |
Bits that you want to skip can have a C<*> specified for them. To skip the rest | |
of the bytes, use a C<**>. | |
As an example, for the number C<0x90> (binary C<0x1001_0000>), the bit field | |
would look like such: | |
<:8 high_flag * * flag2 **> | |
There is a second form of the bitfield construct, shown as such: | |
=for code :allow<R> | |
<:R<bit-size>,R<int-type> R<name1> R<name2> ,R<int-type2> R<name3> ... R<nameN>> | |
This version allows grabbing bit-packed values. These examples will extract the | |
parts of various IEEE floating-point formats: | |
<:32,u1 sign ,i127k8 exponent ,u23 fraction> # binary32 | |
<:128,u1 sign ,i16383k15 exponent ,u112 fraction> # binary128 | |
<:64,u1 sign ,u5 combined ,u8 expcont ,u50 coeffcont> # decimal64 | |
Putting C<Whatever>s in this form will skip the number of bits of the current | |
type. Note that the first form essentially is essentially the second form, but | |
with an implied C<,u1> after the R<bit-size>. Note that the first and second | |
forms are separate, meaning you can't specify a data type in the middle of a | |
first form: | |
<:32 flag1 flag2 ,u30 number> # WRONG, will fail | |
<:32,u1 flag1 flag2 ,u30 number> # OK | |
In both forms, leaving out the necessary parameters, C<Whatever>s, or | |
C<HyperWhatever> to use all R<bit-size> bits will result in the compiler | |
assuming an implicit C<**> at the end of the list, with a warning. Specifying | |
too many items will cause a fatal error. | |
<:16,u4 high_half low_half> # assumed ** after low_half, and warning | |
<:8,u4 high_half low_half * *> # dies, too many parameters | |
<:16,u4 high_half low_half **> # OK, and no warnings | |
=head1 Other constructs | |
=head2 C«<le>» | |
The C<le> construct helps with matching literal numbers that's not in written | |
order. The form is | |
=for code :allow<R> | |
<le R<bits-per-unit> R<number literals>> | |
This takes the string of bytes formed by the R<number literals> in written | |
order, and then reverse on R<bits-per-unit>-sized pieces. Implementations may | |
reject non-whole-byte bit sizes, or alternately allow them when enough whole | |
bytes are specified to specify a whole number of R<bits-per-unit>-sized pieces. | |
For example: | |
f/<le 8 0xFACEF00D>/ is eqv. to f/0x0DF0CEFA/ | |
f/<le 16 0xFACEF00D>/ is eqv. to f/0xF00DFACE/ | |
=head2 General subrule collection | |
Plenty of binary data formats have a set list of offsets in the file, which | |
dictate collecting information from the rest of the data. This construct is | |
meant to ease the specification of that setup. | |
<&& address1 => subrule1, address2 => subrule2 ..., addressN => subruleN> | |
This specifies a list of subrules, and where they can be found in the data. For | |
example: | |
=begin code | |
<file-header> | |
<&& $<file-header><dictoffset> => <dictionary> | |
$<file-header><mdoffset> => <metadata> | |
$<file-header><fdoffset> => <filedata> > | |
=end code | |
The rule needs all the given subrules to successfully match in order to | |
succeed. This construct allows the format to check these subrules in whatever | |
order is most appropriate, since the responsibility of going everywhere lays | |
with the construct. | |
=head2 Data offset examine/control | |
=end pod |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment