Current support for working with binary data in Perl 6 is less than awesome.
The problem goes down to the VM abstraction layer: the nqp:: op set is also
quite impoverished in this area. Binary data is often used in situations where
performance matters, so an interface that allows the VM to optimize well and
generate good code for binary data handling is also important.
- Define a means to provide a view into an array without having to make a copy
- Define an
nqp::op API that can, from a buffer (either of theVMArrayREPR or a 1-dimensional array using theMultiDimArrayREPR), or a view of one, read and write:- Integers, signed and unsigned, of size 8-bit, 16-bit, 32-bit, and 64-bit, with handling of endian swapping
- IEEE floating point numbers of size 32-bit and 64-bit, with handling of endian swapping
- Provide a Perl 6 low-level API proposal for working with binary data. Of course, higher-level things can be built atop of it, but they need a more boring base API to build upon. Futher, given the performance sensitives, boring but relatively easy to optimize is perhaps more valuable anyway.
The ArrayView representation presents a view into either a VMArray,
MultiDimArray, or another ArrayView (this latter case will not build a
chain of objects, but just re-calculate the offsets and lengths). It
implements the positional REPR API, but all reads and writes are forwarded
to the underlying representation. (This means it's a mutable view.) Its uses
include:
- Being able to pass a chunk of binary data to parse off to another routine
without having to pass offsets around (potentially
Blob.subbufcould also use this, since it's immutable) - To decode a string from part of a larger buffer without having to copy the source bytes making up the string
- Implementing partial views of multi-dimensional arrays
For these definitions, buffer refers to a concrete object with a REPR of
either VMArray or MultiDimArray, the latter being constrained to a single
dimension. (Note: dimensionality is a property of the type, meaning that type
specialization is already sufficient to optimize out both the REPR and shape
checks.) In either case, the array must be an 8-bit integer array (as a Perl 6
Blob or Buf will be). An ArrayView onto such an array is also allowed
for use with the binary data manipulation ops.
The following new nqp::const entries are defined for use with the new ops,
and specify sizes to use in reads and writes:
BINARY_SIZE_8_BITBINARY_SIZE_32_BITBINARY_SIZE_16_BITBINARY_SIZE_64_BIT
These nqp::const entries are defined for specifying the endianness of the data
to read or write:
BINARY_ENDIAN_LITTLEBINARY_ENDIAN_BIG
Operations not configured with one of these options will assume native endian. Reading or writing little endian on a little endian machine will, of course, carry no transformation overhead.
Writes the signed integer $value at $offset into the buffer $target,
with the size and endianness specified by $flags.
Writes the unsigned integer $value at $offset into the buffer $target,
with the size and endianness specified by $flags.
Writes the floating point $value at $offset into the buffer $target,
with the size and endianness specified by $flags. Only 32-bit and 64-bit
sizes are supported.
Reads a signed integer at offset $offset from $source with size and
endianness specified by $flags. Returns that value, widened to a 64-bit
int.
Reads an unsigned integer at offset $offset from $source with size and
endianness specified by $flags. Returns that value, widened to a 64-bit
uint.
Reads a floating point number at offset $offset from $source with size and
endianness specified by $flags. Returns that value, widened to a 64-bit
num.
Creates a view of a 1-dimensional view of the $source buffer starting at
element $offset and spanning $length elements.
Provided $source has at least 2 dimensions, forms a view with 1 dimension
fewer, and where the index $idx will be prepended to the dimensions being
used to do a lookup into $source. This allows for an n - 1d view of an
nd array.
The following methods are provided on blob8 and buf8 to provide a
low-level API for reading sized integers and floating point numbers:
read-int8(int $offset, Bool :$big-endian, Bool :$little-endian --> int)read-int16(int $offset, Bool :$big-endian, Bool :$little-endian --> int)read-int32(int $offset, Bool :$big-endian, Bool :$little-endian --> int)read-int64(int $offset, Bool :$big-endian, Bool :$little-endian --> int)read-uint8(int $offset, Bool :$big-endian, Bool :$little-endian --> uint)read-uint16(int $offset, Bool :$big-endian, Bool :$little-endian --> uint)read-uint32(int $offset, Bool :$big-endian, Bool :$little-endian --> uint)read-uint64(int $offset, Bool :$big-endian, Bool :$little-endian --> uint)read-num32(int $offset, Bool :$big-endian, Bool :$little-endian --> num)read-num64(int $offset, Bool :$big-endian, Bool :$little-endian --> num)
Failing to specify endianness implies native endian. The offset in bytes is
the offset to read a value from.
A matching set of write methods exist, taking the value to write:
write-int8(int $offset, int $value, Bool :$big-endian, Bool :$little-endian --> Nil)write-int16(int $offset, int $value, Bool :$big-endian, Bool :$little-endian --> Nil)write-int32(int $offset, int $value, Bool :$big-endian, Bool :$little-endian --> Nil)write-int64(int $offset, int $value, Bool :$big-endian, Bool :$little-endian --> Nil)write-uint8(int $offset, uint $value, Bool :$big-endian, Bool :$little-endian --> Nil)write-uint16(int $offset, uint $value, Bool :$big-endian, Bool :$little-endian --> Nil)write-uint32(int $offset, uint $value, Bool :$big-endian, Bool :$little-endian --> Nil)write-uint64(int $offset, uint $value, Bool :$big-endian, Bool :$little-endian --> Nil)write-num32(int $offset, num $value, Bool :$big-endian, Bool :$little-endian --> Nil)write-num64(int $offset, num $value, Bool :$big-endian, Bool :$little-endian --> Nil)
These provide a lowest common denominator for dealing with integer and floating point number data when doing binary data processing. They allow unaligned reads and writes (with the usual caveats about efficiency).
The view methods on Blob and Buf provide a view of a certain element
range of the Blob or Buf that they are called on.
view(Blob:D: int $offset, int $elems --> BlobView)view(Buf:D: int $offset, int $elems --> BufView)
Where BlobView ~~ Blob and BufView ~~ Buf, allowing them to be passed to
code type-constrained on Blob or Buf. Note that BlobView is immutable as
the underlying Blob also is.
These types are, like Blob and Buf themselves, parametric roles that are
parameterized on a sized integer type. It is possible to decode from a view,
so:
say $foo.view($offset, $length).decode('utf-8');
Would allow for decoding a string from a range of bytes without needing to do a copying operation for those bytes, as is required today.
I like it, I have a few comments though:
ArrayViewis going to end up polymorphic. In which case, we'll want to inline access.ArrayView(which is basically equivalent to golangs' slice, minus the mutable operations) that you may want to have a different object for it.