Instantly share code, notes, and snippets.

@jnthn /binary.md Secret
Created Aug 5, 2018

Embed
What would you like to do?
MoarVM and Perl 6 binary data handling draft proposal

Binary handling primitives in MoarVM and Perl 6

Current support for working with binary data in Perl 6 is less than awesome. The problem goes down to the VM abstraction layer: the nqp:: op set is also quite impoverished in this area. Binary data is often used in situations where performance matters, so an interface that allows the VM to optimize well and generate good code for binary data handling is also important.

Goals of this proposal

  • Define a means to provide a view into an array without having to make a copy
  • Define an nqp:: op API that can, from a buffer (either of the VMArray REPR or a 1-dimensional array using the MultiDimArray REPR), or a view of one, read and write:
    • Integers, signed and unsigned, of size 8-bit, 16-bit, 32-bit, and 64-bit, with handling of endian swapping
    • IEEE floating point numbers of size 32-bit and 64-bit, with handling of endian swapping
  • Provide a Perl 6 low-level API proposal for working with binary data. Of course, higher-level things can be built atop of it, but they need a more boring base API to build upon. Futher, given the performance sensitives, boring but relatively easy to optimize is perhaps more valuable anyway.

The new ArrayView representation

The ArrayView representation presents a view into either a VMArray, MultiDimArray, or another ArrayView (this latter case will not build a chain of objects, but just re-calculate the offsets and lengths). It implements the positional REPR API, but all reads and writes are forwarded to the underlying representation. (This means it's a mutable view.) Its uses include:

  • Being able to pass a chunk of binary data to parse off to another routine without having to pass offsets around (potentially Blob.subbuf could also use this, since it's immutable)
  • To decode a string from part of a larger buffer without having to copy the source bytes making up the string
  • Implementing partial views of multi-dimensional arrays

The nqp::op binary data API extensions

For these definitions, buffer refers to a concrete object with a REPR of either VMArray or MultiDimArray, the latter being constrained to a single dimension. (Note: dimensionality is a property of the type, meaning that type specialization is already sufficient to optimize out both the REPR and shape checks.) In either case, the array must be an 8-bit integer array (as a Perl 6 Blob or Buf will be). An ArrayView onto such an array is also allowed for use with the binary data manipulation ops.

Constants

The following new nqp::const entries are defined for use with the new ops, and specify sizes to use in reads and writes:

  • BINARY_SIZE_8_BIT
  • BINARY_SIZE_32_BIT
  • BINARY_SIZE_16_BIT
  • BINARY_SIZE_64_BIT

These nqp::const entries are defined for specifying the endianness of the data to read or write:

  • BINARY_ENDIAN_LITTLE
  • BINARY_ENDIAN_BIG

Operations not configured with one of these options will assume native endian. Reading or writing little endian on a little endian machine will, of course, carry no transformation overhead.

nqp::writeint(buffer $target, int $offset, int $value, int $flags)

Writes the signed integer $value at $offset into the buffer $target, with the size and endianness specified by $flags.

nqp::writeuint(buffer $target, int $offset, uint $value, int $flags)

Writes the unsigned integer $value at $offset into the buffer $target, with the size and endianness specified by $flags.

nqp::writenum(buffer $target, int $offset, num $value, int $flags)

Writes the floating point $value at $offset into the buffer $target, with the size and endianness specified by $flags. Only 32-bit and 64-bit sizes are supported.

nqp::readint(buffer $source, int $offset, int $flags --> int)

Reads a signed integer at offset $offset from $source with size and endianness specified by $flags. Returns that value, widened to a 64-bit int.

nqp::readuint(buffer $source, int $offset, int $flags --> uint)

Reads an unsigned integer at offset $offset from $source with size and endianness specified by $flags. Returns that value, widened to a 64-bit uint.

nqp::readnum(buffer $source, int $offset, int $flags --> uint)

Reads a floating point number at offset $offset from $source with size and endianness specified by $flags. Returns that value, widened to a 64-bit num.

The nqp::op view extensions

nqp::view(buffer $source, int $offset, int $length --> arrayview)

Creates a view of a 1-dimensional view of the $source buffer starting at element $offset and spanning $length elements.

nqp::viewdim(buffer $source, int $idx --> arrayview)

Provided $source has at least 2 dimensions, forms a view with 1 dimension fewer, and where the index $idx will be prepended to the dimensions being used to do a lookup into $source. This allows for an n - 1d view of an nd array.

New blob8/buf8 methods in Perl 6

The following methods are provided on blob8 and buf8 to provide a low-level API for reading sized integers and floating point numbers:

  • read-int8(int $offset, Bool :$big-endian, Bool :$little-endian --> int)
  • read-int16(int $offset, Bool :$big-endian, Bool :$little-endian --> int)
  • read-int32(int $offset, Bool :$big-endian, Bool :$little-endian --> int)
  • read-int64(int $offset, Bool :$big-endian, Bool :$little-endian --> int)
  • read-uint8(int $offset, Bool :$big-endian, Bool :$little-endian --> uint)
  • read-uint16(int $offset, Bool :$big-endian, Bool :$little-endian --> uint)
  • read-uint32(int $offset, Bool :$big-endian, Bool :$little-endian --> uint)
  • read-uint64(int $offset, Bool :$big-endian, Bool :$little-endian --> uint)
  • read-num32(int $offset, Bool :$big-endian, Bool :$little-endian --> num)
  • read-num64(int $offset, Bool :$big-endian, Bool :$little-endian --> num)

Failing to specify endianness implies native endian. The offset in bytes is the offset to read a value from.

A matching set of write methods exist, taking the value to write:

  • write-int8(int $offset, int $value, Bool :$big-endian, Bool :$little-endian --> Nil)
  • write-int16(int $offset, int $value, Bool :$big-endian, Bool :$little-endian --> Nil)
  • write-int32(int $offset, int $value, Bool :$big-endian, Bool :$little-endian --> Nil)
  • write-int64(int $offset, int $value, Bool :$big-endian, Bool :$little-endian --> Nil)
  • write-uint8(int $offset, uint $value, Bool :$big-endian, Bool :$little-endian --> Nil)
  • write-uint16(int $offset, uint $value, Bool :$big-endian, Bool :$little-endian --> Nil)
  • write-uint32(int $offset, uint $value, Bool :$big-endian, Bool :$little-endian --> Nil)
  • write-uint64(int $offset, uint $value, Bool :$big-endian, Bool :$little-endian --> Nil)
  • write-num32(int $offset, num $value, Bool :$big-endian, Bool :$little-endian --> Nil)
  • write-num64(int $offset, num $value, Bool :$big-endian, Bool :$little-endian --> Nil)

These provide a lowest common denominator for dealing with integer and floating point number data when doing binary data processing. They allow unaligned reads and writes (with the usual caveats about efficiency).

Blob and Buf views

The view methods on Blob and Buf provide a view of a certain element range of the Blob or Buf that they are called on.

  • view(Blob:D: int $offset, int $elems --> BlobView)
  • view(Buf:D: int $offset, int $elems --> BufView)

Where BlobView ~~ Blob and BufView ~~ Buf, allowing them to be passed to code type-constrained on Blob or Buf. Note that BlobView is immutable as the underlying Blob also is.

These types are, like Blob and Buf themselves, parametric roles that are parameterized on a sized integer type. It is possible to decode from a view, so:

say $foo.view($offset, $length).decode('utf-8');

Would allow for decoding a string from a range of bytes without needing to do a copying operation for those bytes, as is required today.

@bdw

This comment has been minimized.

Show comment
Hide comment
@bdw

bdw Aug 5, 2018

I like it, I have a few comments though:

  • I think ArrayView is going to end up polymorphic. In which case, we'll want to inline access.
  • I don't think you've specified, but I'm assuming that we'll want to do range checking. Which means, that to make this perform well, we'll need a way to analyze the range checking away. That will be a nice challenge.
  • I think the notion of an arrayview viewing a dimension of a multi-dimensional source is nice, but I also think that it may be sufficiently distinct from the other use of an ArrayView (which is basically equivalent to golangs' slice, minus the mutable operations) that you may want to have a different object for it.

bdw commented Aug 5, 2018

I like it, I have a few comments though:

  • I think ArrayView is going to end up polymorphic. In which case, we'll want to inline access.
  • I don't think you've specified, but I'm assuming that we'll want to do range checking. Which means, that to make this perform well, we'll need a way to analyze the range checking away. That will be a nice challenge.
  • I think the notion of an arrayview viewing a dimension of a multi-dimensional source is nice, but I also think that it may be sufficiently distinct from the other use of an ArrayView (which is basically equivalent to golangs' slice, minus the mutable operations) that you may want to have a different object for it.
@bbkr

This comment has been minimized.

Show comment
Hide comment
@bbkr

bbkr Aug 11, 2018

Hooray! I'm glad that we will be able finally to write faster pure perl drivers and decoders.

I'd like to add few ideas:

  1. Fix decode() naming. It's too generic and won't be consistent with proposed API. It should be renamed to read-string() or read-str() while old name should be marked as deprecated.
  2. Proposed API is too limited - for example decoding 24-bit or 128-bit or any arbitrary sized integers common in many binary formats will still be painfully slow in high level code. String decoding can work on arbitrary buffer sizes so users should get complementary read-int( :is-signed ) --> Int method that can slurp everything in buffer. Or size param in read method. Anything that will replicate pack/unpack freedom and usability.
  3. I don't understand why there are 2 mechanisms to do the same stuff. What's the difference between $buf[10..19].decode() and $buf.view(10,10).decode()? Direct indexing is more natural.
  4. The proposed read methods are inconsistent with rest of P6 behavior. IO::Handle.read() causes cursor to move, Buf.read-int32() does not. Maybe that's the key to simplify this API a bit? Maybe Buf should track position by default and get seek() method? No views. No offsets in read-*() methods. Instead of them simple and intuitive array operations: my $buf = Buf.new(^9); say $buf.read-int32(); say $buf.read-str(2); say $buf.read-int32(); say $buf.eof?
  5. I'd love to get shortcuts to IO::Handle. So $handle.read(4).read-int32() can simply become $handle.read-int32(). This read.read sequence does not look pretty and users will be forced to use it a lot.

bbkr commented Aug 11, 2018

Hooray! I'm glad that we will be able finally to write faster pure perl drivers and decoders.

I'd like to add few ideas:

  1. Fix decode() naming. It's too generic and won't be consistent with proposed API. It should be renamed to read-string() or read-str() while old name should be marked as deprecated.
  2. Proposed API is too limited - for example decoding 24-bit or 128-bit or any arbitrary sized integers common in many binary formats will still be painfully slow in high level code. String decoding can work on arbitrary buffer sizes so users should get complementary read-int( :is-signed ) --> Int method that can slurp everything in buffer. Or size param in read method. Anything that will replicate pack/unpack freedom and usability.
  3. I don't understand why there are 2 mechanisms to do the same stuff. What's the difference between $buf[10..19].decode() and $buf.view(10,10).decode()? Direct indexing is more natural.
  4. The proposed read methods are inconsistent with rest of P6 behavior. IO::Handle.read() causes cursor to move, Buf.read-int32() does not. Maybe that's the key to simplify this API a bit? Maybe Buf should track position by default and get seek() method? No views. No offsets in read-*() methods. Instead of them simple and intuitive array operations: my $buf = Buf.new(^9); say $buf.read-int32(); say $buf.read-str(2); say $buf.read-int32(); say $buf.eof?
  5. I'd love to get shortcuts to IO::Handle. So $handle.read(4).read-int32() can simply become $handle.read-int32(). This read.read sequence does not look pretty and users will be forced to use it a lot.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment