Skip to content

Instantly share code, notes, and snippets.

@frsyuki
Last active December 15, 2015 08:59
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save frsyuki/5235364 to your computer and use it in GitHub Desktop.
Save frsyuki/5235364 to your computer and use it in GitHub Desktop.

MessagePack update proposal v4

MessagePack specification

MessagePack is an object serialization specification like JSON.

MessagePack offers two concepts: type system and formats. In MessagePack, serialization works by converting application types into MessagePack types, and converting the types into MessagePack formats. Deserialization works by converting MessagePack formats into MessagePack types, and converting the types into application type system.

Serlialization:
    Application types (objects in memory)
    -->  MessagePack types
    -->  MessagePack formats (byte array)

Deserialization:
    MessagePack formats (byte array)
    -->  MessagePack types
    -->  Application types (object in memory)

This document describes the MessagePack type system, MesagePack formats and the conversion of them.

Type system

  • Types
    • Integer represents an integer
    • Nil represents nil
    • Boolean represents true or false
    • Float represents a floating point number
    • Raw
      • String extending Raw type represents a UTF-8 string
      • Binary extending Raw type and implementing Extension interface represents a byte array
    • Array represents a sequence of objects
    • Map represents key-value pairs of objects
    • Extended implements Extension interface: represents a tuple of type information and a byte array where type informatin is an integer whose meaning is defined by applications
  • Interfaces
    • Extension represents a tuple of an integer and a byte array where the integer represents type information and the byte array represents data. The format of the data is defined by concrete types

Limitation

  • a value of Integer objects is from -(2^63) upto (2^64)-1
  • a value of Float objects is IEEE 754 single or double precision floating-point number
  • maximum length of Binary objects is (2^32)-1
  • maximum byte size of String objects is (2^32)-1
  • String objects may contain invalid byte sequence as a UTF-8 string and the behavior of deserializers depends on implementation when it received invalid byte sequence
    • Deserializers should provide a mechanism to get the original byte array so that applications can decide how to handle the object
  • maximum number of elements of Array objects is (2^32)-1
  • maximum number of key-value associations Map objects is (2^32)-1

Extension interface and Extended type

MessagePack allows applications to define types. These type definition are built on top of the MessagePack type system and MessagePack itself uses the Extended type to represent them. Applications assign 0 to 127 to store the type information of application-specific types.

On the other hand, MessagePack expects future extension to add types that will be described in other documents. MessagePack uses -1 to -128 to store the type information of predefined types.

[0, 127]: application-specific types
[-1, -128]: predefined types

Binary type is one of predefined extension types. Its type number is -1.

Format

Overview

format namefirst byte (in binary)first byte (in hex)
positive fixint0xxxxxxx0x00 - 0x7f
fixmap1000xxxx0x80 - 0x8f
fixarray1001xxxx0x90 - 0x9f
fixraw101xxxxx0xa0 - 0xbf
nil110000000xc0
(never used)110000010xc1
false110000100xc2
true110000110xc3
fixext 0110001000xc4
fixext 1110001010xc5
fixext 2110001100xc6
fixext 3110001110xc7
fixext 4110010000xc8
fixext 5110010010xc9
float 32110010100xca
float 64110010110xcb
uint 8110011000xcc
uint 16110011010xcd
uint 32110011100xce
uint 64110011110xcf
int 8110100000xd0
int 16110100010xd1
int 32110100100xd2
int 64110100110xd3
ext 8 type -1110101000xd4
ext 16 type -1110101010xd5
ext 32 type -1110101100xd6
ext 8110101110xd7
ext 16110110000xd8
ext 32110110010xd9
raw 16110110100xda
raw 32110110110xdb
array 16110111000xdc
array 32110111010xdb
map 16110111100xde
map 32110111110xdf
negative fixint111xxxxx0xe0 - 0xff

Notation in diagrams

one byte:
+--------+
|        |
+--------+

a variable number of bytes:
+========+
|        |
+========+

variable number of objects stored in MessagePack format:
+~~~~~~~~~~~~~~~~+
|                |
+~~~~~~~~~~~~~~~~+

 X, Y, Z, G, and H are the symbols that will be replaced by an actual bit

nil format

Nil format stores nil in 1 byte:

nil:
+--------+
|  0xc0  |
+--------+

bool format family

Bool format family stores false or true in 1 byte:

false:
+--------+
|  0xc2  |
+--------+

true:
+--------+
|  0xc3  |
+--------+

int format family

Int format family stores an integer in 1, 2, 3, 5, or 9 bytes.

positive fixnum stores 7-bit positive integer
+--------+
|0XXXXXXX|
+--------+

where
* 0XXXXXXX is 8-bit integer

negative fixnum stores 5-bit negative integer
+--------+
|111YYYYY|
+--------+

where
* 111YYYYY is 8-bit signed integer

uint 8 stores a 8-bit unsigned integer
+--------+--------+
|  0xcc  |ZZZZZZZZ|
+--------+--------+

uint 16 stores a 16-bit big-endian unsigned integer
+--------+--------+--------+
|  0xcd  |ZZZZZZZZ|ZZZZZZZZ|
+--------+--------+--------+

uint 32 stores a 32-bit big-endian unsigned integer
+--------+--------+--------+--------+--------+
|  0xce  |ZZZZZZZZ|ZZZZZZZZ|ZZZZZZZZ|ZZZZZZZZ
+--------+--------+--------+--------+--------+

uint 64 stores a 64-bit big-endian unsigned integer
+--------+--------+--------+--------+--------+--------+--------+--------+--------+
|  0xcf  |ZZZZZZZZ|ZZZZZZZZ|ZZZZZZZZ|ZZZZZZZZ|ZZZZZZZZ|ZZZZZZZZ|ZZZZZZZZ|ZZZZZZZZ|
+--------+--------+--------+--------+--------+--------+--------+--------+--------+

int 8 stores a 8-bit signed integer
+--------+--------+
|  0xd0  |ZZZZZZZZ|
+--------+--------+

int 16 stores a 16-bit big-endian signed integer
+--------+--------+--------+
|  0xd1  |ZZZZZZZZ|ZZZZZZZZ|
+--------+--------+--------+

int 32 stores a 32-bit big-endian signed integer
+--------+--------+--------+--------+--------+
|  0xd2  |ZZZZZZZZ|ZZZZZZZZ|ZZZZZZZZ|ZZZZZZZZ|
+--------+--------+--------+--------+--------+

int 64 stores a 64-bit big-endian signed integer
+--------+--------+--------+--------+--------+--------+--------+--------+--------+
|  0xd3  |ZZZZZZZZ|ZZZZZZZZ|ZZZZZZZZ|ZZZZZZZZ|ZZZZZZZZ|ZZZZZZZZ|ZZZZZZZZ|ZZZZZZZZ|
+--------+--------+--------+--------+--------+--------+--------+--------+--------+

float format family

Float format family stores an floating point number in 5 bytes or 9 bytes:

float 32 stores a floating point number in IEEE 754 single precision floating point number format:
+--------+--------+--------+--------+--------+
|  0xca  |XXXXXXXX|XXXXXXXX|XXXXXXXX|XXXXXXXX
+--------+--------+--------+--------+--------+

float 64 stores a floating point number in IEEE 754 double precision floating point number format:
+--------+--------+--------+--------+--------+--------+--------+--------+--------+
|  0xca  |YYYYYYYY|YYYYYYYY|YYYYYYYY|YYYYYYYY|YYYYYYYY|YYYYYYYY|YYYYYYYY|YYYYYYYY|
+--------+--------+--------+--------+--------+--------+--------+--------+--------+

where
* XXXXXXXX_XXXXXXXX_XXXXXXXX_XXXXXXXX is a big-endian IEEE 754 single precision
  floating point number
* YYYYYYYY_YYYYYYYY_YYYYYYYY_YYYYYYYY_YYYYYYYY_YYYYYYYY_YYYYYYYY_YYYYYYYY is a big-endian
  IEEE 754 double precision floating point number

raw format family

Raw format family stores an byte array in 1, 3, or 5 bytes of extra bytes in addition to the size of the byte array.

fixraw stores a byte array whose length is upto 31 bytes:
+--------+========+
|101XXXXX|  data  |
+--------+========+

raw 16 stores a byte array whose length is upto (2^16)-1 bytes:
+--------+--------+--------+========+
|  0xda  |YYYYYYYY|YYYYYYYY|  data  |
+--------+--------+--------+========+

raw 32 stores a byte array whose length is upto (2^32)-1 bytes:
+--------+--------+--------+--------+--------+--------+--------+--------+--------+========+
|  0xdb  |ZZZZZZZZ|ZZZZZZZZ|ZZZZZZZZ|ZZZZZZZZ|  data  |
+--------+--------+--------+--------+--------+--------+--------+--------+--------+========+

where:
* XXXXX is a 5-bit unsigned integer which represents N
* YYYYYYYY_YYYYYYYY is a 16-bit big-endian unsigned integer which represents N
* ZZZZZZZZ_ZZZZZZZZ_ZZZZZZZZ_ZZZZZZZZ is a 32-bit big-endian unsigned integer which represents N
* N is the length of data

array format family

Array format family stores a sequence of key-value pairs in 1, 3, or 5 bytes of extra bytes in addition to the elements.

fixarray stores an array whose length is upto 15 elements:
+--------+~~~~~~~~~~~~~~~~+
|1001XXXX|         N objects        |
+--------+~~~~~~~~~~~~~~~~+

array 16 stores an array whose length is upto (2^16)-1 elements:
+--------+--------+--------+~~~~~~~~~~~~~~~~+
|  0xdc  |YYYYYYYY|YYYYYYYY|         N objects        |
+--------+--------+--------+~~~~~~~~~~~~~~~~+

array 32 stores an array whose length is upto (2^32)-1 elements:
+--------+--------+--------+--------+--------+~~~~~~~~~~~~~~~~+
|  0xdd  |ZZZZZZZZ|ZZZZZZZZ|ZZZZZZZZ|ZZZZZZZZ|         N objects        |
+--------+--------+--------+--------+--------+~~~~~~~~~~~~~~~~+

where:
* XXXX is a 4-bit unsigned integer which represents N
* YYYYYYYY_YYYYYYYY is a 16-bit big-endian unsigned integer which represents N
* ZZZZZZZZ_ZZZZZZZZ_ZZZZZZZZ_ZZZZZZZZ is a 32-bit big-endian unsigned integer which represents N
    N is the size of a array

map format family

Map format family stores a sequence of key-value pairs in 1, 3, or 5 bytes of extra bytes in addition to the key-value pairs.

fixmap stores a map whose length is upto 15 elements
+--------+~~~~~~~~~~~~~~~~+
|1000XXXX|        N*2 objects       |
+--------+~~~~~~~~~~~~~~~~+

map 16 stores a map whose length is upto (2^16)-1 elements
+--------+--------+--------+~~~~~~~~~~~~~~~~+
|  0xde  |YYYYYYYY|YYYYYYYY|        N*2 objects       |
+--------+--------+--------+~~~~~~~~~~~~~~~~+

map 32 stores a map whose length is upto (2^32)-1 elements
+--------+--------+--------+--------+--------+~~~~~~~~~~~~~~~~+
|  0xdf  |ZZZZZZZZ|ZZZZZZZZ|ZZZZZZZZ|ZZZZZZZZ|        N*2 objects       |
+--------+--------+--------+--------+--------+~~~~~~~~~~~~~~~~+

where:
* XXXX is a 4-bit unsigned integer which represents N
* YYYYYYYY_YYYYYYYY is a 16-bit big-endian unsigned integer which represents N
* ZZZZZZZZ_ZZZZZZZZ_ZZZZZZZZ_ZZZZZZZZ is a 32-bit big-endian unsigned integer which represents N
* N is the size of a map
* odd elements in objects are keys of a map
* the next element of a key is its associated value

ext format family

Ext format family stores a tuple of an integer and a byte array.

fixext 0 stores an integer and a byte array whose length is 0 bytes
+--------+--------+
|  0xc4  |  type  |
+--------+--------+

fixext 1 stores an integer and a byte array whose length is 1 byte
+--------+--------+--------+
|  0xc5  |  type  |  data  |
+--------+--------+--------+

fixext 2 stores an integer and a byte array whose length is 2 bytes
+--------+--------+--------+--------+
|  0xc6  |  type  |       data      |
+--------+--------+--------+--------+

fixext 3 stores an integer and a byte array whose length is 3 bytes
+--------+--------+--------+--------+--------+
|  0xc7  |  type  |           data           |
+--------+--------+--------+--------+--------+

fixext 4 stores an integer and a byte array whose length is 4 bytes
+--------+--------+--------+--------+--------+--------+
|  0xc8  |  type  |                data               |
+--------+--------+--------+--------+--------+--------+

fixext 5 stores an integer and a byte array whose length is 5 bytes
+--------+--------+--------+--------+--------+--------+--------+
|  0xc9  |  type  |                    data                    |
+--------+--------+--------+--------+--------+--------+--------+

ext 8 type -1 a byte array whose length is upto (2^8)-1 bytes:
+--------+--------+========+
|  0xd4  |XXXXXXXX|  data  |
+--------+--------+========+

ext 16 type -1 a byte array whose length is upto (2^16)-1 bytes:
+--------+--------+--------+========+
|  0xd5  |YYYYYYYY|YYYYYYYY|  data  |
+--------+--------+--------+========+

ext 32 type -1 a byte array whose length is upto (2^32)-1 bytes:
+--------+--------+--------+--------+--------+========+
|  0xd6  |ZZZZZZZZ|ZZZZZZZZ|ZZZZZZZZ|ZZZZZZZZ|  data  |
+--------+--------+--------+--------+--------+========+

ext 8 stores an integer and a byte array whose length is upto (2^8)-1 bytes:
+--------+--------+--------+========+
|  0xd7  |XXXXXXXX|  type  |  data  |
+--------+--------+--------+========+

ext 16 stores an integer and a byte array whose length is upto (2^16)-1 bytes:
+--------+--------+--------+--------+========+
|  0xd8  |YYYYYYYY|YYYYYYYY|  type  |  data  |
+--------+--------+--------+--------+========+

ext 32 stores an integer and a byte array whose length is upto (2^32)-1 bytes:
+--------+--------+--------+--------+--------+--------+========+
|  0xd9  |ZZZZZZZZ|ZZZZZZZZ|ZZZZZZZZ|ZZZZZZZZ|  type  |  data  |
+--------+--------+--------+--------+--------+--------+========+

where
* XXXXXXXX is a 8-bit unsigned integer which represents N
* YYYYYYYY_YYYYYYYY is a 16-bit big-endian unsigned integer which represents N
* ZZZZZZZZ_ZZZZZZZZ_ZZZZZZZZ_ZZZZZZZZ is a big-endian 32-bit unsigned integer which represents N
* N is a length of data
* type is a 8-bit signed integer; 0xff (-1) is reserved to store the type in 2-byte in the future
  and causes format error by default

Serialization: Type to format projection

MessagePack serializers convert MessagePack types into formats as following:

source typesoutput format
Integerint format family (positive fixint, negative fixint, int 8/16/32/64, or uint 8/16/32/64)
Nilnil
Booleanbool format family (false or true)
Floatfloat format family (float 32 or float 64)
Stringraw format family (fixraw, raw 16, or raw 32)
Binaryext format family where type is -1 (ext 8/16/32 type -1)
Arrayarray format family (fixarray, array 16, or array 32)
Mapmap format family (fixmap, map 16, or map 32)
Extendedext format family (fixext, or ext 8/16/32)

If an object can be represented in multiple possible output formats, serializers SHOULD use the format which represents the data in the smallest number of bytes.

Deserialization: Format to type projection

MessagePack deserializers convert convert MessagePack formats into types as following:

source formatsoutput type
positive fixint, negative fixint, int 8/16/32/64, or uint 8/16/32/64Integer
nilNil
false and trueBoolean
float 32 and float 64Float
fixraw, raw 16, and raw 32String
fixext, and ext 8/16/32 type -1Binary
fixarray, array 16, and array 32Array
fixmap, map 16, and map 32Map
fixext, and ext 8/16/32Extended

Profiles

Applications may restrict the semantics of MessagePack sharing the same syntax to adapt MessagePack for certain use cases. MessagePack defines a set of restrictions as a profile.

Applicaiton profile

This is the default profile which restricts nothing.

Primitive profile

Primitive profile removes String type, Binary type, Extended type, and Extension interface. This is useful if applications use schema.

Type system

  • Integer: represents an integer
  • Nil: represents nil
  • Boolean: represents true or false
  • Float: represents a floating point number
  • Raw: represents a UTF-8 string or byte array
  • Array: represents a sequence of objects
  • Map: represents key-value pairs of objects

Serialization: Type to format projection:

source typesoutput format
Integerint format family (positive fixint, negative fixint, int 8/16/32/64, or uint 8/16/32/64)
Nilnil
Booleanbool format family (false or true)
Floatfloat format family (float 32 or float 64)
Rawraw format family (fixraw, raw 16, or raw 32)
Arrayarray format family (fixarray, array 16, or array 32)
Mapmap format family (fixmap, map 16, or map 32)

Deserialization: Format to type projection:

source formatsoutput type
positive fixint, negative fixint, int 8/16/32/64, or uint 8/16/32/64Integer
nilNil
false and trueBoolean
float 32 and float 64Float
fixraw, raw 16, and raw 32Raw
fixext, ext 8/16/32 type -1Raw
fixarray, array 16, and array 32Array
fixmap, map 16, and map 32Map

Future discussion

Canonical profile

This is useful where identity of two objects is important such as:

  • identifier of a data on a database
  • authentication or digital signature

TODO

Focuses of MessagePack

This document describes the problems that MessagePack focues on.

schema-full static typing and schema-less dynamic typing

TODO

statically-typed language and dynamically typed language

TODO

weak-string code and strong-string code

TODO

user-defined extension

TODO

Implementation guidelines

The purpose of this document is providing guidelines to implement MessagePack specification.

Streaming API

TODO

Streaming deserializer

TODO

Streaming serializer

TODO

Extended type

TODO

Extending serializers

TODO

Extending deserializers

TODO

Disabling predefined extension

TODO

Disabling all extension

TODO

Disabling String and Binary types

Implementations of serializers and deserializers should offer applications an option "binary_extension" so that applications can choose ambiguity-tolerant behavior.

  • Serializers:
    • if binary_extension=true, serializers store byte arrays using the Binary type (Extension type where type number=-1)
    • if binary_extension=false, serializers store byte arrays using the String type
    • for languages which don't have types to distinguish strings and byte arrays, msgpack implementations provide users with a way to set markers on byte arrays (such as a wrapper class)
      • in those weak-string code, serializers may use the String type to store byte arays if users don't set the markers
  • Deserializers:
    • If binary_extension=true, deserializers restore String type into a string object. (ambiguity-strict behavior)
    • If binary_extension=true, deserializers may validate UTF-8 strings on restoring String type. Although it depends on implementations how the deserializers handle strings including invalid byte sequence as a UTF-8 string, Here are some examples:
      • it returns an instance of a special class which has a field to hold the original byte sequence
      • it calls a registered callback function and returns the value returned by the function
    • if binary_extension=false, deserializers don't validate UTF-8 on restoring String type at all. If the language can't include invalid byte sequence within a string object, deserializers don't restore String type into the string type (ambiguity-tolerant behavior)
    • If binary_extension=false, deserializers may restore Binary type and String type into the same type

Type API

TODO

Application-level type projection

TODO

IDL

TODO

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment