frsyuki/msgpack-update-proposal2.md

## msgpack-update-proposal2.md

      
    Raw
  

              msgpack-update-proposal2.md
            
          
    2013-03-10 19:57:44 -0700: This proposal is updated by the 3rd version. See the third version: https://gist.github.com/frsyuki/5131535
Overview


Change the current Raw type to "String" type

Meaning current FixRaw, raw 16, and raw 32 become FixString, string 16, and string 32


Add "Extended" type (type-length-value type)
Assign type=0 of the "Extended" type to "binary" type

Background 1 (string types)

There're languages where a distinction between strings and binaries is unclear. For example:

Languages which don't have types to distinguish strings and byte arrays (e.g.: PHP, C++, Erlang, OCaml)
Languages which uses flag or other additional information to distinguish strings but those additional information are not used in many common use cases (e.g.: Perl, Ruby, Python 2?)

This article calls these languages as weak-string languages.
On the other hand, there're languages which are dynamically-typed and clearly distinguish strings and byte arrays (e.g.: JavaScript, Objective-C, Python 3, Python 2?). And there're languages statically-typed and clearly distinguish strings and byte arrays (e.g.: Java, C#, Scala). This article calls thse languages as strong-string languages.
Note: It is dependent on how to use the language that which category a language is classified into.
In object exchanges where a strong-string and dynamically-typed language is a deserializer, there're requirements to transparently restore data stored as a string as a string, and data serialized as a byte array as a byte array. (This requirements don't exist just excepting this case)
It's unrealistic to expect that msgpack implementations can distinguish strings and byte arrays in programs written in weak-string languages. Because programmers need extra work to set markers which mean "this is a string" on all strings, or markers which mean "this is a byte array"  on all byte arrays.
Therefore I propose changes to satisfy following advantages at the same time:

in object exchanges across strong-string languages, new spec makes it possible to transparently exchange a string as a string, a byte array as a byte array
in object exchanges across weak-string languages and strong-string languages:

it makes it possible to transparently restore string/binary type information if users set markers on all byte arrays
here I assume that number of byte arrays is fewer than one of strings and thus it's easier to set markers on all byte arrays rather than all strings
even users don't set markers on all byte arrays, the new spec doesn't prevent programs to exchange objects without transparency


in object exchanges across weak-string languages, it keeps compatibility with current msgpack

Background 2 (future extension and compatibility)

TODO
Therefore I propose changes to satisfy following advantages at the same time (in addition to the string type issue):
TODO
Changes on the type system


Extended: represents a tuple of type and byte array
Binary: represents a byte array
String: represents a string encoded in UTF-8

this type may include invalid byte sequence as a UTF-8 string. There 6 reasons:

It depends on implementation whether a serializer validates a string on string it
Validation on storing strings impacts on performance significantly.
Weak-string languages may deal with a byte array using string type and it's unrealistic to expect that msgpack implementations can distinguish strings and byte arrays in programs written in these languages
The current msgpack spec allows users to store byte arrays in the region assigned to the String type


It's applications' business to validate strings and decide how to handle invalid strings


The behavior depends on implementations when an implementation received a string object which includes invalid byte sequence as a UTF-8 string. It may raise exceptions, it may return other objects which is not a string. But it's highly recommended to provide users a way to get the original byte array if the string object includes invalid byte sequence so that applications can decide how to validate or handle the string.
Changes on the format

0xa0-0xbf: FixString (0 - 31bytes String type)  // changed

0xc4-0xc9 + 1byteTag: FixExtended (0 - 5bytes Extended type)  // new

0xd4-0xc5 + 1byteTag: FixExtended (6 - 7bytes Extended type)  // new

0xd6 + 1byteTag: extended 8 (Extended type)  // new
0xd7 + 1byteTag: extended 16 (Extended type)  // new
0xd8 + 1byteTag: extended 32 (Extended type)  // new

0xd9: string 8 (String type)  // new
0xda: string 16 (String type)  // changed from raw 16
0xdb: string 32 (String type)  // changed from raw 32

Extended type:
Extended = ExtendedTag (1byte) + n

ExtendedTag:
0x00      - binary
0xf0-0xff - private extension

Example of extended type
represent a time Feb 25 03:08:32 2013 (0x512ad5b0) without time zone
with assuming time uses ExtendedTag=0x31:

0xc7 0x31 0x51 0x2a 0xd5 0xb0

Implementation guidelines

for strong-string and dynamically-typed languages


Serializers:

store byte arrays using the Binary type
store strings using the String type
may implement an option to store byte arrays in the String type to keep the backward compatibility with current msgpack implementations


Deserializers:

return an object which can distinguish a byte array from a string
should provide a feature to entrust applications with the way to handle strings which include invalid byte sequence
How to implement this feature is not specified. Here are some examples:

it returns an object which has the original byte sequence
it defines another class to represent strings which can include invalid byte sequence and it returns its instance regardless of whether the object is stored in the Byte or the String type.
it calls a registered callback function if it gets invalid byte sequence and returns the value returned by the function


Extended type (meaning Extended type excepting whose tag is 0=binary):

Deserializers:

return an object whose type is something special (e.g.: MessagePackExtended) which represents tuple of a byte and a byte array.


for weak-string languages


Serializers:

store the object in the String type if it can't know clearly whether the object represents a string or a byte array
don't have to validate a string on storing it.
should store the object in the Binary type if users set a marker which means "this is a byte array" on the object


Deserializers:

should not validate and reject a string which include invalid byte sequence as a UTF-8 string by default

because it's applications business to validate and decide how to handle invalid strings


may implement an option (or mode) to validate strings
may return an objects with a flag or additional information to describe the object is stored in Binary type

this option is required to implement a tool in weak-string languages to read MessagePack data and write MessagePack data where the output data need to keep the original type information
an example to implement this feature in PHP is that the deserializer returns an instance of "MessagePackBinary" which has a byte array field when it restores an object in Binary type


Extended type (meaning Extended type excepting whose tag is 0=binary):

Deserializers:

return an object whose type is something special (e.g.: MessagePackExtended) which represents tuple of a byte and a byte array.


Compatibility with the current implementations


In a new minor release, the updated deserializer can handle the new string 8, binary 8, binary 16 and binary 32 and assumes they are the current Raw type

Thus this implementation has a forward compatibility with the updated serializers
I guess this is not so difficult


In a new major release, the updated serializer stores byte arrays in the Binary type

The updated serializer should provide an option: to store all byte arrays in String type, and not to use string 8 format

The purpose of this option is that users can generate exatly same result after upgrading the major version if necessary


The updated deserializer should provide an option to return an object stored in both the String and the Binary as a byte array

The purpose of this option is that users can get exactly same objects in programs after upgrading the major version if necessary


It should show clearly which options need to be enabled to keep backward compatibility in release notes or documents