Skip to content

Instantly share code, notes, and snippets.

@frsyuki
Last active January 4, 2021 12:59
Show Gist options
  • Save frsyuki/5028082 to your computer and use it in GitHub Desktop.
Save frsyuki/5028082 to your computer and use it in GitHub Desktop.

2013-03-10 19:57:44 -0700: This proposal is updated by the 3rd version. See the third version: https://gist.github.com/frsyuki/5131535

Overview

  • Change the current Raw type to "String" type
    • Meaning current FixRaw, raw 16, and raw 32 become FixString, string 16, and string 32
  • Add "Extended" type (type-length-value type)
  • Assign type=0 of the "Extended" type to "binary" type

Background 1 (string types)

There're languages where a distinction between strings and binaries is unclear. For example:

  • Languages which don't have types to distinguish strings and byte arrays (e.g.: PHP, C++, Erlang, OCaml)
  • Languages which uses flag or other additional information to distinguish strings but those additional information are not used in many common use cases (e.g.: Perl, Ruby, Python 2?)

This article calls these languages as weak-string languages.

On the other hand, there're languages which are dynamically-typed and clearly distinguish strings and byte arrays (e.g.: JavaScript, Objective-C, Python 3, Python 2?). And there're languages statically-typed and clearly distinguish strings and byte arrays (e.g.: Java, C#, Scala). This article calls thse languages as strong-string languages.

Note: It is dependent on how to use the language that which category a language is classified into.

In object exchanges where a strong-string and dynamically-typed language is a deserializer, there're requirements to transparently restore data stored as a string as a string, and data serialized as a byte array as a byte array. (This requirements don't exist just excepting this case)

It's unrealistic to expect that msgpack implementations can distinguish strings and byte arrays in programs written in weak-string languages. Because programmers need extra work to set markers which mean "this is a string" on all strings, or markers which mean "this is a byte array" on all byte arrays.

Therefore I propose changes to satisfy following advantages at the same time:

  • in object exchanges across strong-string languages, new spec makes it possible to transparently exchange a string as a string, a byte array as a byte array
  • in object exchanges across weak-string languages and strong-string languages:
    • it makes it possible to transparently restore string/binary type information if users set markers on all byte arrays
    • here I assume that number of byte arrays is fewer than one of strings and thus it's easier to set markers on all byte arrays rather than all strings
    • even users don't set markers on all byte arrays, the new spec doesn't prevent programs to exchange objects without transparency
  • in object exchanges across weak-string languages, it keeps compatibility with current msgpack

Background 2 (future extension and compatibility)

TODO

Therefore I propose changes to satisfy following advantages at the same time (in addition to the string type issue):

TODO

Changes on the type system

  • Extended: represents a tuple of type and byte array
  • Binary: represents a byte array
  • String: represents a string encoded in UTF-8
    • this type may include invalid byte sequence as a UTF-8 string. There 6 reasons:
      • It depends on implementation whether a serializer validates a string on string it
      • Validation on storing strings impacts on performance significantly.
      • Weak-string languages may deal with a byte array using string type and it's unrealistic to expect that msgpack implementations can distinguish strings and byte arrays in programs written in these languages
      • The current msgpack spec allows users to store byte arrays in the region assigned to the String type
    • It's applications' business to validate strings and decide how to handle invalid strings

The behavior depends on implementations when an implementation received a string object which includes invalid byte sequence as a UTF-8 string. It may raise exceptions, it may return other objects which is not a string. But it's highly recommended to provide users a way to get the original byte array if the string object includes invalid byte sequence so that applications can decide how to validate or handle the string.

Changes on the format

0xa0-0xbf: FixString (0 - 31bytes String type)  // changed

0xc4-0xc9 + 1byteTag: FixExtended (0 - 5bytes Extended type)  // new

0xd4-0xc5 + 1byteTag: FixExtended (6 - 7bytes Extended type)  // new

0xd6 + 1byteTag: extended 8 (Extended type)  // new
0xd7 + 1byteTag: extended 16 (Extended type)  // new
0xd8 + 1byteTag: extended 32 (Extended type)  // new

0xd9: string 8 (String type)  // new
0xda: string 16 (String type)  // changed from raw 16
0xdb: string 32 (String type)  // changed from raw 32

Extended type:

Extended = ExtendedTag (1byte) + n

ExtendedTag:

0x00      - binary
0xf0-0xff - private extension

Example of extended type

represent a time Feb 25 03:08:32 2013 (0x512ad5b0) without time zone
with assuming time uses ExtendedTag=0x31:

0xc7 0x31 0x51 0x2a 0xd5 0xb0

Implementation guidelines

for strong-string and dynamically-typed languages

  • Serializers:

    • store byte arrays using the Binary type
    • store strings using the String type
    • may implement an option to store byte arrays in the String type to keep the backward compatibility with current msgpack implementations
  • Deserializers:

    • return an object which can distinguish a byte array from a string
    • should provide a feature to entrust applications with the way to handle strings which include invalid byte sequence
    • How to implement this feature is not specified. Here are some examples:
      • it returns an object which has the original byte sequence
      • it defines another class to represent strings which can include invalid byte sequence and it returns its instance regardless of whether the object is stored in the Byte or the String type.
      • it calls a registered callback function if it gets invalid byte sequence and returns the value returned by the function
  • Extended type (meaning Extended type excepting whose tag is 0=binary):

    • Deserializers:
      • return an object whose type is something special (e.g.: MessagePackExtended) which represents tuple of a byte and a byte array.

for weak-string languages

  • Serializers:

    • store the object in the String type if it can't know clearly whether the object represents a string or a byte array
    • don't have to validate a string on storing it.
    • should store the object in the Binary type if users set a marker which means "this is a byte array" on the object
  • Deserializers:

    • should not validate and reject a string which include invalid byte sequence as a UTF-8 string by default
      • because it's applications business to validate and decide how to handle invalid strings
    • may implement an option (or mode) to validate strings
    • may return an objects with a flag or additional information to describe the object is stored in Binary type
      • this option is required to implement a tool in weak-string languages to read MessagePack data and write MessagePack data where the output data need to keep the original type information
      • an example to implement this feature in PHP is that the deserializer returns an instance of "MessagePackBinary" which has a byte array field when it restores an object in Binary type
  • Extended type (meaning Extended type excepting whose tag is 0=binary):

    • Deserializers:
      • return an object whose type is something special (e.g.: MessagePackExtended) which represents tuple of a byte and a byte array.

Compatibility with the current implementations

  • In a new minor release, the updated deserializer can handle the new string 8, binary 8, binary 16 and binary 32 and assumes they are the current Raw type
    • Thus this implementation has a forward compatibility with the updated serializers
    • I guess this is not so difficult
  • In a new major release, the updated serializer stores byte arrays in the Binary type
    • The updated serializer should provide an option: to store all byte arrays in String type, and not to use string 8 format
      • The purpose of this option is that users can generate exatly same result after upgrading the major version if necessary
    • The updated deserializer should provide an option to return an object stored in both the String and the Binary as a byte array
      • The purpose of this option is that users can get exactly same objects in programs after upgrading the major version if necessary
    • It should show clearly which options need to be enabled to keep backward compatibility in release notes or documents
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment