Skip to content

Instantly share code, notes, and snippets.

@frsyuki
Last active December 14, 2015 19:49
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save frsyuki/5139552 to your computer and use it in GitHub Desktop.
Save frsyuki/5139552 to your computer and use it in GitHub Desktop.

MessagePack update proposal v3.5

Purpose

  • msgpack provides users with a mechanism to deserialize/serialize string objects transparently without causing incompatibility
  • msgpack provides upper-layer code of msgpack with a mechanism to define original types without changing msgpack spec
  • msgpack keeps compatibility and doesn't cause any impacts on existent code even if new types were defined to msgpack in the future

Overview

  • Add "Extension" type
    • Add FixExt, ext 8, ext 16, ext 32 formats
    • Define Binary type as a part of the Extension type (Extension tag=0)
  • Applications can define custom types using the "Extension" type
  • Deserializers with "binary_extension" enables users to distinguish byte arrays from strings
    • deserializers without "binary_extension" offers perfect compatibility with existent data and implementations
  • Change the current "Raw" type to "String" type
    • Meaning current FixRaw, raw 16, and raw 32 become FixString, string 16, and string 32

Strings and byte arrays

Here classifies programs into 3 groups:

  • weak-string code: programs where the distinction between strings and byte arrays is ambiguous (from serializers' point of view)
    • programs written in languages which don't have types to a distinguish strings and byte arrays (PHP, C++, Erlang, OCaml, etc)
    • programs written in languages which uses flags or other additional information to distinguish strings but those information are not used in some common use cases (e.g.: Perl, etc)
  • statically-typed strong-string code: programs where a distinction between strings and byte arrays is clear (from serializers' point of view) and type information are given to deserializers (deserializers can use the type information as a schema)
  • dynamically-typed strong-string code: distinction between strings and byte arrays is clear but type information is not given to deserializers, and programs suppose restored objects have supposed type information

It's unrealistic to expect that msgpack implementations can distinguish strings and byte arrays in weak-string code. Because programmers need extra work to set markers which mean "this is a string" on all strings, or markers which mean "this is a byte array" on all byte arrays.

And validation before storing every strings impacts on performance significantly. Thus even if msgpack has a type to represent UTF-8 strings, deserializers can't always assume it always contains valid UTF-8 strings.

On the other hand, In object exchanges where a strong-string dynamically-typed language is a deserializer, there're requirements to transparently restore data stored as a string as a string, and data serialized as a byte array as a byte array. This requirements don't exist just excepting this case.

Applications have 2 options for above problem:

  • ambiguity-tolerant behavior: deserializers accept data which don't distinguish strings and byte arrays clearly. These deserializers don't require serializers to distinguish strings and byte arrays clearly. (This is same as the current msgpack)
  • ambiguity-strict behavior: deserializers assume that data distinguish strings and byte arrays clearly. These deserializers require serializers to distinguish strings and byte arrays clearly.

Solution

I assume ambiguity-tolerant behavior and ambiguity-strict behavior don't exist at the same time.

  • If there're at least one deserializer which works with ambiguity-strict behavior, strings and byte arrays have to be distinguished clearly in all data
  • Otherwise, all deserializers have to work with ambiguity-tolerant behavior.

Deserializers working with ambiguity-strict behavior assume that String type includes only VALID UTF-8 strings, and byte arrays are stored using Binary type (which is newly added as a part of Extension type).

Above limitation provides following advantages:

  • We don't have to add both String and Binary types. Thus msgpack can store strings in smaller bytes
  • Applications which don't use byte arrays at all don't have to worry about the ambiguity of strings and byte arrays
    • Here assumes that amount of applications which don't use byte arrays is larger than one of applications which don't use strings
  • We can keep the msgpack's type system simple

On the other hand, it brings following disadvantages:

  • It's difficult to switch the behavior of deserializers to ambiguity-strict behavior
    • Users need to keep using ambiguity-tolerant behavior, or change code and convert all data at the same time

Note: Even the other methods may not able to solve this disadvantage

Changes on the type system

  • Extension type: a tuple of a byte array and an integer called Extension tag
  • Binary: part of the Extension type (Extension tag=0). Binary represents byte arrays
  • String: UTF-8 encoded strings
    • Applications may agree that String represents byte arrays as well if they desire (ambiguity-tolerant behavior)

The result depends on implementations when an implementation received a string object which includes invalid byte sequence as a UTF-8 string. It may raise exceptions, it may return other objects which is not a string. But it's highly recommended to provide users with a mechanism to get the original byte array if the string object includes invalid byte sequence so that applications can decide how to validate or handle the string.

Changes on the format

0xc0 11000000 nil          (Nil type)
0xc1 11000001 (never used)
0xc2 11000010 false        (Boolean type)
0xc3 11000011 true         (Boolean type)

0xc4 11000100 FixExt 4     (Extension type 4byte)   // new
0xc5 11000101 FixExt 5     (Extension type 5byte)   // new
0xc6 11000110 FixExt 6     (Extension type 6byte)   // new
0xc7 11000111 FixExt 7     (Extension type 7byte)   // new
0xc8 11001000 FixExt 8     (Extension type 8byte)   // new

0xc9 11001001 ext 8        (Extension type 8bit)    // new

...

0xd4 11010100 FixExt 0     (Extension type 0byte)   // new
0xd5 11010101 FixExt 1     (Extension type 1byte)   // new
0xd6 11010110 FixExt 2     (Extension type 2byte)   // new
0xd7 11010111 FixExt 3     (Extension type 3byte)   // new

0xd8 11011000 ext 16       (Extension type 16bit)   // new
0xd9 11011001 ext 32       (Extension type 32bit)   // new

0xda 11011010 string 16    (String type 16bit)
0xdb 11011011 string 32    (String type 32bit)

0xdc 11011100 array 16     (Array type 16bit)
0xdb 11011101 array 32     (Array type 32bit)

0xde 11011110 map 16       (Map type 16bit)
0xdf 11011111 map 32       (Map type 32bit)

Extension type

Format of the Extension type:

FixExt 1
    +--------+--------+--------+
    |  0xd5  |  0xTT  |XXXXXXXX|
    +--------+--------+--------+
    => 1 bytes of application-specific object

ext 8
    +--------+--------+--------+--------
    |  0xc9  |  0xTT  |XXXXXXXX|...N bytes
    +--------+--------+--------+--------
    => XXXXXXXX (=N) bytes of application-specific object

Where "0xTT" means a 1-byte integer which represents a Extension tag.

Binary type

Binary type is a part of the Extension type, and uses 0 for the Extension tag.

Implementation guidelines

Implementations of serializers and deserializers should offer applications an option "binary_extension" so that applications can choose ambiguity-tolerant behavior as well.

  • Serializers:
    • if binary_extension=true, serializers store byte arrays using the Binary type (Extension type where tag=0); this should be the default behavior
    • if binary_extension=false, serializers store byte arrays using the String type
    • for languages which don't have types to distinguish strings and byte arrays, msgpack implementations provide users with a way to set markers on byte arrays (such as a wrapper class)
      • in those weak-string code, serializers may use the String type to store byte arays if users don't set the markers
  • Deserializers:
    • If binary_extension=true, deserializers restore String type into a string object. (ambiguity-strict behavior); this should be the default behavior
    • If binary_extension=true, deserializers may validate UTF-8 strings on restoring String type. Although it depends on implementations how the deserializers handle strings including invalid byte sequence as a UTF-8 string, Here are some examples:
      • it returns an instance of a special class which has a field to hold the original byte sequence
      • it calls a registered callback function and returns the value returned by the function
    • if binary_extension=false, deserializers don't validate UTF-8 on restoring String type at all. If the language can't include invalid byte sequence within a string object, deserializers don't restore String type into the string type. (ambiguity-tolerant behavior)
    • If binary_extension=false, deserializers may restore Binary type and String type into the same type

Future extensions

If some types are added to msgpack in the future, its implementation would be as following (I used Time type for example):

  • Serializers:
    • if time_extension=true, serializers automatically use Time type (which is a part of Extension type) to store time objects
    • If time_extension=false, serializers don't automatically use the Time type
  • Deserializers:
    • If time_extension=true, serializers restore Time type into a time object
    • If time_extension=false, deserializers restore the object into a tuple of an integer and a byte array.

Wrapper libraries of msgpack can define original types using the Extension type without affecting the msgpack specification.

Concept of "Profiles"

Basic Profile

The MessagePack specification without the Extension type is named "Basic Profile." Applications are required to use ambiguity-tolerant behavior.

Note: existent msgpack implementations can be assumed that they support only the Basic Profile.

Application Profile

The MessagePack specification with the Extension type is named "Application Profile." Applications can choise ambiguity-strict behavor or ambiguity-tolerant behavior.

Applications can define application-specific types using the Extension type.

Canonical Profile

Note: This is one of the possible future discussion.

  • type of the keys of maps must be String
  • keys of maps must be sorted by bytes
  • objects must be stored using the smallest format

Guidelines of new releases of msgpack implementations

  • In a minor release, deserializers support the Extension type with tag=0 (Binary type) and returns the type same with the String type
  • In a major release, deserializers and serializers support binary_extension option
    • It should be described in documents that binary_extension is enabled by default,
  • In a major release, deserializers support the Extension type and return an object of an original class (or something) which represents a tuple of integer and byte arrays
  • In a major release, serializers support the Extension type and store objects of the original class (or something) using the Extension type

Q&A

What's the meaning of the assignments of FixExt format?

That assignments make implementation of serializers and deserializers simple.

We can optimize the implementation of deserializers as follows:

int length;
switch(b) {
case 0xc4..0xc8:
    length = b & 0x0f;
    goto fixext;
case 0xd4..0xd7:
    length = b & 0x03;
fixext:
    // …
    break;
}

or:

if((0xc4 <= b && b <= 0xc8) || (0xc4 <= && b == 0xd7)) {
    length = (b & 0b1111) ^ ((b & 0b10000) >> 2);
    // …
}

We can optimize the implementation of serializers as follows:

if(length <= 4) {
    int b = 0xd4 | length;
    // …
} else if(length <= 8) {
    int b = 0xc0 | length;
    // …
} else {
    …
}

Comments on this proposal

msgpack/msgpack#128

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment