jnthn/binary.md Secret

## binary.md

      
    Raw
  

              binary.md
            
          
    Binary handling primitives in MoarVM and Perl 6

Current support for working with binary data in Perl 6 is less than awesome.
The problem goes down to the VM abstraction layer: the nqp:: op set is also
quite impoverished in this area. Binary data is often used in situations where
performance matters, so an interface that allows the VM to optimize well and
generate good code for binary data handling is also important.
Goals of this proposal


Define a means to provide a view into an array without having to make a
copy
Define an nqp:: op API that can, from a buffer (either of the VMArray
REPR or a 1-dimensional array using the MultiDimArray REPR), or a view
of one, read and write:

Integers, signed and unsigned, of size 8-bit, 16-bit, 32-bit, and
64-bit, with handling of endian swapping
IEEE floating point numbers of size 32-bit and 64-bit, with handling
of endian swapping


Provide a Perl 6 low-level API proposal for working with binary data. Of
course, higher-level things can be built atop of it, but they need a more
boring base API to build upon. Futher, given the performance sensitives,
boring but relatively easy to optimize is perhaps more valuable anyway.

The new ArrayView representation

The ArrayView representation presents a view into either a VMArray,
MultiDimArray, or another ArrayView (this latter case will not build a
chain of objects, but just re-calculate the offsets and lengths). It
implements the positional REPR API, but all reads and writes are forwarded
to the underlying representation. (This means it's a mutable view.) Its uses
include:

Being able to pass a chunk of binary data to parse off to another routine
without having to pass offsets around (potentially Blob.subbuf could also
use this, since it's immutable)
To decode a string from part of a larger buffer without having to copy the
source bytes making up the string
Implementing partial views of multi-dimensional arrays

The nqp::op binary data API extensions

For these definitions, buffer refers to a concrete object with a REPR of
either VMArray or MultiDimArray, the latter being constrained to a single
dimension. (Note: dimensionality is a property of the type, meaning that type
specialization is already sufficient to optimize out both the REPR and shape
checks.) In either case, the array must be an 8-bit integer array (as a Perl 6
Blob or Buf will be). An ArrayView onto such an array is also allowed
for use with the binary data manipulation ops.
Constants

The following new nqp::const entries are defined for use with the new ops,
and specify sizes to use in reads and writes:

BINARY_SIZE_8_BIT
BINARY_SIZE_32_BIT
BINARY_SIZE_16_BIT
BINARY_SIZE_64_BIT

These nqp::const entries are defined for specifying the endianness of the data
to read or write:

BINARY_ENDIAN_LITTLE
BINARY_ENDIAN_BIG

Operations not configured with one of these options will assume native endian.
Reading or writing little endian on a little endian machine will, of course,
carry no transformation overhead.
nqp::writeint(buffer $target, int $offset, int $value, int $flags)

Writes the signed integer $value at $offset into the buffer $target,
with the size and endianness specified by $flags.
nqp::writeuint(buffer $target, int $offset, uint $value, int $flags)

Writes the unsigned integer $value at $offset into the buffer $target,
with the size and endianness specified by $flags.
nqp::writenum(buffer $target, int $offset, num $value, int $flags)

Writes the floating point $value at $offset into the buffer $target,
with the size and endianness specified by $flags. Only 32-bit and 64-bit
sizes are supported.
nqp::readint(buffer $source, int $offset, int $flags --> int)

Reads a signed integer at offset $offset from $source with size and
endianness specified by $flags. Returns that value, widened to a 64-bit
int.
nqp::readuint(buffer $source, int $offset, int $flags --> uint)

Reads an unsigned integer at offset $offset from $source with size and
endianness specified by $flags. Returns that value, widened to a 64-bit
uint.
nqp::readnum(buffer $source, int $offset, int $flags --> uint)

Reads a floating point number at offset $offset from $source with size and
endianness specified by $flags. Returns that value, widened to a 64-bit
num.
The nqp::op view extensions

nqp::view(buffer $source, int $offset, int $length --> arrayview)

Creates a view of a 1-dimensional view of the $source buffer starting at
element $offset and spanning $length elements.
nqp::viewdim(buffer $source, int $idx --> arrayview)

Provided $source has at least 2 dimensions, forms a view with 1 dimension
fewer, and where the index $idx will be prepended to the dimensions being
used to do a lookup into $source. This allows for an n - 1d view of an
nd array.
New blob8/buf8 methods in Perl 6

The following methods are provided on blob8 and buf8 to provide a
low-level API for reading sized integers and floating point numbers:

read-int8(int $offset, Bool :$big-endian, Bool :$little-endian --> int)
read-int16(int $offset, Bool :$big-endian, Bool :$little-endian --> int)
read-int32(int $offset, Bool :$big-endian, Bool :$little-endian --> int)
read-int64(int $offset, Bool :$big-endian, Bool :$little-endian --> int)
read-uint8(int $offset, Bool :$big-endian, Bool :$little-endian --> uint)
read-uint16(int $offset, Bool :$big-endian, Bool :$little-endian --> uint)
read-uint32(int $offset, Bool :$big-endian, Bool :$little-endian --> uint)
read-uint64(int $offset, Bool :$big-endian, Bool :$little-endian --> uint)
read-num32(int $offset, Bool :$big-endian, Bool :$little-endian --> num)
read-num64(int $offset, Bool :$big-endian, Bool :$little-endian --> num)

Failing to specify endianness implies native endian. The offset in bytes is
the offset to read a value from.
A matching set of write methods exist, taking the value to write:

write-int8(int $offset, int $value, Bool :$big-endian, Bool :$little-endian --> Nil)
write-int16(int $offset, int $value, Bool :$big-endian, Bool :$little-endian --> Nil)
write-int32(int $offset, int $value, Bool :$big-endian, Bool :$little-endian --> Nil)
write-int64(int $offset, int $value, Bool :$big-endian, Bool :$little-endian --> Nil)
write-uint8(int $offset, uint $value, Bool :$big-endian, Bool :$little-endian --> Nil)
write-uint16(int $offset, uint $value, Bool :$big-endian, Bool :$little-endian --> Nil)
write-uint32(int $offset, uint $value, Bool :$big-endian, Bool :$little-endian --> Nil)
write-uint64(int $offset, uint $value, Bool :$big-endian, Bool :$little-endian --> Nil)
write-num32(int $offset, num $value, Bool :$big-endian, Bool :$little-endian --> Nil)
write-num64(int $offset, num $value, Bool :$big-endian, Bool :$little-endian --> Nil)

These provide a lowest common denominator for dealing with integer and floating
point number data when doing binary data processing. They allow unaligned reads
and writes (with the usual caveats about efficiency).
Blob and Buf views

The view methods on Blob and Buf provide a view of a certain element
range of the Blob or Buf that they are called on.

view(Blob:D: int $offset, int $elems --> BlobView)
view(Buf:D: int $offset, int $elems --> BufView)

Where BlobView ~~ Blob and BufView ~~ Buf, allowing them to be passed to
code type-constrained on Blob or Buf. Note that BlobView is immutable as
the underlying Blob also is.
These types are, like Blob and Buf themselves, parametric roles that are
parameterized on a sized integer type. It is possible to decode from a view,
so:
say $foo.view($offset, $length).decode('utf-8');

Would allow for decoding a string from a range of bytes without needing to do
a copying operation for those bytes, as is required today.