Current support for working with binary data in Perl 6 is less than awesome.
The problem goes down to the VM abstraction layer: the nqp::
op set is also
quite impoverished in this area. Binary data is often used in situations where
performance matters, so an interface that allows the VM to optimize well and
generate good code for binary data handling is also important.
- Define a means to provide a view into an array without having to make a copy
- Define an
nqp::
op API that can, from a buffer (either of theVMArray
REPR or a 1-dimensional array using theMultiDimArray
REPR), or a view of one, read and write:- Integers, signed and unsigned, of size 8-bit, 16-bit, 32-bit, and 64-bit, with handling of endian swapping
- IEEE floating point numbers of size 32-bit and 64-bit, with handling of endian swapping
- Provide a Perl 6 low-level API proposal for working with binary data. Of course, higher-level things can be built atop of it, but they need a more boring base API to build upon. Futher, given the performance sensitives, boring but relatively easy to optimize is perhaps more valuable anyway.
The ArrayView
representation presents a view into either a VMArray
,
MultiDimArray
, or another ArrayView
(this latter case will not build a
chain of objects, but just re-calculate the offsets and lengths). It
implements the positional REPR API, but all reads and writes are forwarded
to the underlying representation. (This means it's a mutable view.) Its uses
include:
- Being able to pass a chunk of binary data to parse off to another routine
without having to pass offsets around (potentially
Blob.subbuf
could also use this, since it's immutable) - To decode a string from part of a larger buffer without having to copy the source bytes making up the string
- Implementing partial views of multi-dimensional arrays
For these definitions, buffer
refers to a concrete object with a REPR of
either VMArray
or MultiDimArray
, the latter being constrained to a single
dimension. (Note: dimensionality is a property of the type, meaning that type
specialization is already sufficient to optimize out both the REPR and shape
checks.) In either case, the array must be an 8-bit integer array (as a Perl 6
Blob
or Buf
will be). An ArrayView
onto such an array is also allowed
for use with the binary data manipulation ops.
The following new nqp::const
entries are defined for use with the new ops,
and specify sizes to use in reads and writes:
BINARY_SIZE_8_BIT
BINARY_SIZE_32_BIT
BINARY_SIZE_16_BIT
BINARY_SIZE_64_BIT
These nqp::const
entries are defined for specifying the endianness of the data
to read or write:
BINARY_ENDIAN_LITTLE
BINARY_ENDIAN_BIG
Operations not configured with one of these options will assume native endian. Reading or writing little endian on a little endian machine will, of course, carry no transformation overhead.
Writes the signed integer $value
at $offset
into the buffer $target
,
with the size and endianness specified by $flags
.
Writes the unsigned integer $value
at $offset
into the buffer $target
,
with the size and endianness specified by $flags
.
Writes the floating point $value
at $offset
into the buffer $target
,
with the size and endianness specified by $flags
. Only 32-bit and 64-bit
sizes are supported.
Reads a signed integer at offset $offset
from $source
with size and
endianness specified by $flags
. Returns that value, widened to a 64-bit
int
.
Reads an unsigned integer at offset $offset
from $source
with size and
endianness specified by $flags
. Returns that value, widened to a 64-bit
uint
.
Reads a floating point number at offset $offset
from $source
with size and
endianness specified by $flags
. Returns that value, widened to a 64-bit
num
.
Creates a view of a 1-dimensional view of the $source
buffer starting at
element $offset
and spanning $length
elements.
Provided $source
has at least 2 dimensions, forms a view with 1 dimension
fewer, and where the index $idx
will be prepended to the dimensions being
used to do a lookup into $source
. This allows for an n - 1
d view of an
n
d array.
The following methods are provided on blob8
and buf8
to provide a
low-level API for reading sized integers and floating point numbers:
read-int8(int $offset, Bool :$big-endian, Bool :$little-endian --> int)
read-int16(int $offset, Bool :$big-endian, Bool :$little-endian --> int)
read-int32(int $offset, Bool :$big-endian, Bool :$little-endian --> int)
read-int64(int $offset, Bool :$big-endian, Bool :$little-endian --> int)
read-uint8(int $offset, Bool :$big-endian, Bool :$little-endian --> uint)
read-uint16(int $offset, Bool :$big-endian, Bool :$little-endian --> uint)
read-uint32(int $offset, Bool :$big-endian, Bool :$little-endian --> uint)
read-uint64(int $offset, Bool :$big-endian, Bool :$little-endian --> uint)
read-num32(int $offset, Bool :$big-endian, Bool :$little-endian --> num)
read-num64(int $offset, Bool :$big-endian, Bool :$little-endian --> num)
Failing to specify endianness implies native endian. The offset
in bytes is
the offset to read a value from.
A matching set of write methods exist, taking the value to write:
write-int8(int $offset, int $value, Bool :$big-endian, Bool :$little-endian --> Nil)
write-int16(int $offset, int $value, Bool :$big-endian, Bool :$little-endian --> Nil)
write-int32(int $offset, int $value, Bool :$big-endian, Bool :$little-endian --> Nil)
write-int64(int $offset, int $value, Bool :$big-endian, Bool :$little-endian --> Nil)
write-uint8(int $offset, uint $value, Bool :$big-endian, Bool :$little-endian --> Nil)
write-uint16(int $offset, uint $value, Bool :$big-endian, Bool :$little-endian --> Nil)
write-uint32(int $offset, uint $value, Bool :$big-endian, Bool :$little-endian --> Nil)
write-uint64(int $offset, uint $value, Bool :$big-endian, Bool :$little-endian --> Nil)
write-num32(int $offset, num $value, Bool :$big-endian, Bool :$little-endian --> Nil)
write-num64(int $offset, num $value, Bool :$big-endian, Bool :$little-endian --> Nil)
These provide a lowest common denominator for dealing with integer and floating point number data when doing binary data processing. They allow unaligned reads and writes (with the usual caveats about efficiency).
The view
methods on Blob
and Buf
provide a view of a certain element
range of the Blob
or Buf
that they are called on.
view(Blob:D: int $offset, int $elems --> BlobView)
view(Buf:D: int $offset, int $elems --> BufView)
Where BlobView ~~ Blob
and BufView ~~ Buf
, allowing them to be passed to
code type-constrained on Blob
or Buf
. Note that BlobView
is immutable as
the underlying Blob
also is.
These types are, like Blob
and Buf
themselves, parametric roles that are
parameterized on a sized integer type. It is possible to decode from a view,
so:
say $foo.view($offset, $length).decode('utf-8');
Would allow for decoding a string from a range of bytes without needing to do a copying operation for those bytes, as is required today.
I like it, I have a few comments though:
ArrayView
is going to end up polymorphic. In which case, we'll want to inline access.ArrayView
(which is basically equivalent to golangs' slice, minus the mutable operations) that you may want to have a different object for it.