Skip to content

Instantly share code, notes, and snippets.

@frsyuki
Last active January 4, 2021 12:58
Show Gist options
  • Star 8 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save frsyuki/5022569 to your computer and use it in GitHub Desktop.
Save frsyuki/5022569 to your computer and use it in GitHub Desktop.

2013-02-24 22:28:51 -0800: This is not a fixed proposal. See the second version as well: https://gist.github.com/frsyuki/5028082

Overview

  • Add "Binary" type
  • Change the current Raw type to "String" type
    • Meaning current FixRaw, raw 16, and raw 32 become FixString, string 16, and string 32

Background

There're languages where a distinction between strings and binaries is unclear. For example:

  • Languages which don't have types to distinguish strings and byte arrays (e.g.: PHP, C++, Erlang, OCaml)
  • Languages which uses flag or other additional information to distinguish strings but those additional information are not used in many common use cases (e.g.: Perl, Python 2, Ruby)

This article calls these languages as weak-string languages.

On the other hand, there're languages which are dynamically-typed and clearly distinguish strings and byte arrays (e.g.: JavaScript, Objective-C, Python 3). And there're languages statically-typed and clearly distinguish strings and byte arrays (e.g.: Java, C#, Scala). This article calls thse languages as strong-string languages.

Note: It is dependent on how to use the language that which category a language is classified into.

In object exchanges where a strong-string and dynamically-typed language is a deserializer, there're requirements to transparently restore data stored as a string as a string, and data serialized as a byte array as a byte array. (This requirements don't exist just excepting this case)

It's unrealistic to expect that msgpack implementations can distinguish strings and byte arrays in programs written in weak-string languages. Because programmers need extra work to set markers which mean "this is a string" on all strings, or markers which mean "this is a byte array" on all byte arrays.

Therefore I propose changes to satisfy following advantages at the same time:

  • in object exchanges across strong-string languages, new spec makes it possible to transparently exchange a string as a string, a byte array as a byte array
  • in object exchanges across weak-string languages and strong-string languages:
    • it makes it possible to transparently restore string/binary type information if users set markers on all byte arrays
    • here I assume that number of byte arrays is fewer than one of strings and thus it's easier to set markers on all byte arrays rather than all strings
    • even users don't set markers on all byte arrays, the new spec doesn't prevent programs to exchange objects without transparency
  • in object exchanges across weak-string languages, it keeps compatibility with current msgpack

Changes on the type system

  • Binary: represents byte arrays
  • String: represents strings encoded in UTF-8
    • this type may include invalid byte sequence as a UTF-8 string. There 6 reasons:
      • It depends on implementation whether a serializer validates a string on string it
      • Validation on storing strings impacts on performance significantly.
      • Weak-string languages may deal with a byte array using string type and it's unrealistic to expect that msgpack implementations can distinguish strings and byte arrays in programs written in these languages
      • The current msgpack spec allows users to store byte arrays in the region assigned to the String type
    • It's applications' business to validate strings and decide how to handle invalid strings

The behavior depends on implementations when an implementation received a string object which includes invalid byte sequence as a UTF-8 string. It may raise exceptions, it may return other objects which is not a string. But it's highly recommended to provide users a way to get the original byte array if the string object includes invalid byte sequence so that applications can decide how to validate or handle the string.

Changes on the format

0xa0-0xbf FixString (0bytes - 31bytes String type)  // changed

0xd5 binary 8 (Binary type)  // new
0xd6 binary 16 (Binary type)  // new
0xd7 binary 32 (Binary type)  // new

0xd8 reserved

0xd9 string 8 (String type)  // new
0xda string 16 (String type)  // changed from raw 16
0xdb string 32 (String type)  // changed from raw 32

Implementation guidelines

for strong-string and dynamically-typed languages

  • Serializers:
    • store byte arrays using the Binary type
    • store strings using the String type
    • may implement an option to store byte arrays in the String type to keep the backward compatibility with current msgpack implementations
  • Deserializers:
    • return an object which can distinguish a byte array from a string
    • should provide a feature to entrust applications with the way to handle strings which include invalid byte sequence
    • How to implement this feature is not specified. Here are some examples:
      • it returns an object which has the original byte sequence
      • it defines another class to represent strings which can include invalid byte sequence and it returns its instance regardless of whether the object is stored in the Byte or the String type.
      • it calls a registered callback function if it gets invalid byte sequence and returns the value returned by the function

for weak-string languages

  • Serializers:
    • store the object in the String type if it can't know clearly whether the object represents a string or a byte array
    • don't have to validate a string on storing it.
    • should store the object in the Binary type if users set a marker which means "this is a byte array" on the object
  • Deserializers:
    • should not validate and reject a string which include invalid byte sequence as a UTF-8 string by default
      • because it's applications business to validate and decide how to handle invalid strings
    • may implement an option (or mode) to validate strings
    • may return an objects with a flag or additional information to describe the object is stored in Binary type
      • this option is required to implement a tool in weak-string languages to read MessagePack data and write MessagePack data where the output data need to keep the original type information
      • an example to implement this feature in PHP is that the deserializer returns an instance of "MessagePackBinary" which has a byte array field when it restores an object in Binary type

Compatibility with the current implementations

  • In a new minor release, the updated deserializer can handle the new string 8, binary 8, binary 16 and binary 32 and assumes they are the current Raw type
    • Thus this implementation has a forward compatibility with the updated serializers
    • I guess this is not so difficult
  • In a new major release, the updated serializer stores byte arrays in the Binary type
    • The updated serializer should provide an option: to store all byte arrays in String type, and not to use string 8 format
      • The purpose of this option is that users can generate exatly same result after upgrading the major version if necessary
    • The updated deserializer should provide an option to return an object stored in both the String and the Binary as a byte array
      • The purpose of this option is that users can get exactly same objects in programs after upgrading the major version if necessary
    • It should show clearly which options need to be enabled to keep backward compatibility in release notes or documents
@dankogai
Copy link

Extended type is fine but why change the raw type to (UTF-8) string type?
consider the following pseudocode:

packed = msgpack.pack(data);
nested = msgpack.pack({'packed':packed});

The proposal breaks it which is absolutely okay now.

@frsyuki
Copy link
Author

frsyuki commented Feb 25, 2013

@dankogai

String type can include invalid byte sequence as a UTF-8 string.
Updated deserializers should provide applications with a way to get the orignal byte array.
Updated deserializers should provide an option to deserialize both String and Binary into a byte array.

These are described in this article and I think it's enough to keep compatibility.

@lucian1900
Copy link

Python 2 does differentiate between bytes (str) and strings (unicode), they're just unfortunately named. Python 3 has only renamed them and switched the default literal.

I'm curious, why not do the reverse: leave the current bytes type alone and an a UTF-8 type. This way "weak-string" languages can continue operating on bytes as they currently have and do the exact same thing to the UTF-8 type, whereas "strong-string" languages can make the difference clear.

It seems to me that "weak-string" languages simply do not have strings at all (only bytes), so it might be better to just let msgpack strings degrade to bytes in that case.

@mjw9100
Copy link

mjw9100 commented May 28, 2013

Instead of adding a string type, has anyone suggested that we add "descriptor" objects (aka "hints") to the protocol definition? These could provide context for the deserializer, or could be skipped if unneeded. It would be useful to have both "locally defined" and standardized descriptors. A few of the standardized descriptors would be used for the suggested purpose, namely indicating that a UTF-8, UTF-16, etc. string follows. These descriptors could also be used for other interesting extensions (debug or comment blocks, protocol markers, etc.) without breaking "general" deserializers. Your comments are appreciated. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment