Skip to content

Instantly share code, notes, and snippets.

@Juerd
Last active April 25, 2016 20:30
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save Juerd/ae574b87d40a66649692 to your computer and use it in GitHub Desktop.
Save Juerd/ae574b87d40a66649692 to your computer and use it in GitHub Desktop.
RFC: A more Perl6-esque "unpack"
================================
This is an idea for an "unpack" replacement. The basic reasoning behind it, is
that number encodings and string encodings needn't be treated all that
differently. Instead of passing the name of a string encoding, you can pass
a native type object. When decoding things of determinable lengths, any number
of types can be given.
A variable length thing without a length indication can only be passed at the
end.
Decode according to a template:
$blob.decode( [ ... ] )
Decode a string:
my $s = $blob.decode("utf8")
# actually short for: $blob.decode([ ::Inf => "utf8" ])
Decode a natively encoded numeric value:
my $i = $blob.decode(uint16);
Decode a natively encoded numeric value, and a string:
my ($n, $s) = $blob.decode([ num, "latin1" ]);
This doesn't work:
my ($s, $i) = $blob.decode([ "latin1", uint16 ]); # FAILS
# Can't determine string length!
Force endianness for a single value:
my $i = $blob.decode([ :big(uint32) ]);
Set default endianness for the rest of the template:
my @i = $blob.decode([ :big, uint32, uint16, uint8 ]);
Decode two byte-length-prefixed blobs:
my ($blob1, $blob2) = $blob.decode([ ::uint32 => Blob, ::uint32 => Blob ]);
or:
my ($blob1, $blob2) = $blob.decode([ (::uint32 => Blob) xx 2 ]);
Decode any number of byte-length-prefixed blobs:
my @blobs = $blob.decode([ ::Inf => [ ::uint32 => Blob ] ]);
Decode any number of byte-length-prefixed strings:
my @strings = $blob.decode([ ::Inf => [ ::uint32 => "Windows-1252" ] ]);
A list of equityped things, with a counter prefix (as opposed to byte length):
my @i = $blob.decode([ :elems(uint8) => uint32 ]);
A sub-template with a typed byte length prefix:
[ ::uint32 => [ int32, uint16, "latin1" ] ]
A list of equityped things, with a BYTE length prefix:
[ ::uint32 => uint32 ]
Skipping a byte with Nil (when packing (encoding), Nil becomes \0):
[ int, int, int, Nil, int, int ]
User-defined number encoding in the mix:
my ($command, $param) = $blob.decode([ :big, uint8, MQTT::Length => Blob ]);
if $command == 0x30 {
my ($topic, $message) = $param.decode([:big,
::uint16 => "utf8",
Blob
]);
}
Note that:
* The KEY of a pair is part of the template, but NOT of the actual data returned
by decode. This holds true for length prefixes (key is a type object) and for
hints like :big and :little (key is a string).
* Pairs can nest like this :
:big(uint16) => Blob
:elems(:big(uint16)) => uint64
* The compiler will eat pairs, thinking they're named arguments. This is why
templates are arrays.
Things that P5's unpack does, that this proposal does not cover:
* Hexadecimal, binary, or uuencoded strings. These are actually string
encodings, and should be implemented as such. (p5 <b B h H u U>)
* Absolute position based extraction ('@' and '.' in p5's pack). Don't know if
this is actually ever used, or how it even works.
* Pointers to strings.
* Null-terminated strings. Just have a Nil in there.
Juerd <juerd@tnx.nl>
@smls
Copy link

smls commented Dec 23, 2015

Force endianness for a single value:

my $i = $blob.decode([ :big(uint32) ]);

Another option for this would be to allow nested arrays:

my $i = $blob.decode([ ..., [:big, uint32], ... ]);

Or to simply set the flags for all following parameter until unset:

my $i = $blob.decode([ ..., :big, uint32, :little, uint32 ... ]);

Not sure if this is better. I'm just brainstorming...

@bluebear94
Copy link

Let's have more people to see this!

Also, encoding/decoding into num32 and num64 will be supported, right?

It would be even better to support encoding/decoding Ints as well.

@bluebear94
Copy link

By "counter prefix", you mean something like this, right?

{
  my $blob = Blob.new(
    0, 0xCB, 0x16, 0, 0,
    1, 0x23, 0x29, 0, 0,
    2, 0x40, 0x42, 0x0F, 0
  );
  my @i = $blob.decode([:elems(uint8) => uint32]);
  is-deeply @i, (5835, 9001, 1_000_000),
    "extracting equityped things with a counter prefix";
}

Or rather:

{
  my $blob = Blob.new(
    3,
    0xCB, 0x16, 0, 0,
    0x23, 0x29, 0, 0,
    0x40, 0x42, 0x0F, 0
  );
  my @i = $blob.decode([:elems(uint8) => uint32]);
  is-deeply @i, (5835, 9001, 1_000_000),
    "extracting equityped things with a counter prefix";
}

(Now that I think about it, I infer the latter.)

@Juerd
Copy link
Author

Juerd commented Dec 27, 2015

The counter prefix is indeed like your second example. It indicates the number of times the subtemplate is applied. It's a length prefix, but not in number of bytes, but in number of structures.

@bluebear94
Copy link

And this would be correct as well, right?

{
    my $blob = Blob.new(
      8, 0, 0, 0,
      1, 0, 0, 0,
      2, 0,
      65, 66,
      9, 0, 0, 0,
      3, 1, 0, 0,
      2, 2,
      67, 68, 69
    );
    my @i = $blob.decode([::uint32 => [int32, uint16, "latin1"]]);
    my @expected = (
      (1, 2, "AB"),
      (259, 514, "CDE")
    );
    is-deeply @i, @expected,
        "extracting a sub-template with a byte length prefix";
}

@Juerd
Copy link
Author

Juerd commented Jan 6, 2016

Yes. I don't know which endianness should be the default, though. Let's ask Gulliver when he returns from his travels...

@timo
Copy link

timo commented Apr 25, 2016

It'd really be nice to have terminator-specified decoding work somehow. like a "latin1" or "utf8" could be understood/configured to stop at the first null-byte it finds.

@Xliff
Copy link

Xliff commented Apr 25, 2016

It took me a while, but I understand this and it makes some degree of sense. Especially since, as designed, it will fit into the already existing implementation.

Here's what took me some time to understand. Using the above example:

      8, 0, 0, 0,     #uint32 length 0x08
      1, 0, 0, 0,     #int32
      2, 0,           #uint16
      65, 66,         #"latin1" (can only be 2 chars here since it all needs to fit in 0x08 bytes!
      9, 0, 0, 0,     #uint32 length 0x09
      3, 1, 0, 0,     #int32
      2, 2            #uint16,
      67, 68, 69      #"latin1" (again, can only be 3 characters because length prefix is 9 bytes.

How do you suggest handling things like headers, though? In situations where the string length is known, it seems remiss to not include them in this design. Here's a suggestion following what you have already thought up. We just extend the byte-prefix notation to include a static length:

my $b = Buf.new(65, 66, 67, 68); 
my @i = $b.decode( 4 => "latin1" );
my @expected = ("ABCD");
is-deeply @i, @expected,
        "extracting a sub-template with a byte length prefix";

Is there a particular reason why you think something like this is unnecessary?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment