Horusiath/hyperion.md

## hyperion.md

      
    Raw
  

              hyperion.md
            
          
    Hyperion

This is a draft for binary format and implementation proposal for Hyperion, a fast binary serializer for .NET. Currently Hyperion is in beta and uses non-standard binary format. This document aims to specify it. Before we do that, there are few important properties of the serializer we wish to maintain:
Polymorphic

Hyperion is polymorphic serializer, therefore it's tolerant for subtyping rules. Example using C# pseudo-code:
var serializer = new Serializer();
using (var stream = new MemoryStream())
{
	var writer = new StreamWriter(stream);
    var reader = new StreamReader(stream);
    
    // write int message
    int message = 123;
    serializer.Serialize(message, writer, StatelessSession.Default);
    
    stream.Position = 0;
    object decoded = serializer.Deserialize(reader, StatelessSession.Default);
    
    Console.WriteLine(decoded.GetType()); // System.Int32
}
Here, we can see that decoded message has been decoded into a correct type (int, the same as the original message) without requiring from deserialize method to specify output type.
Polymorphism also goes deeper i.e. given abstract class Animal with subtypes Dog and Cat, we can easily serialize and deserialize List<Animal> without specifying any extra steps.
Generic-type aware

This is useful for situations, where we have a library, that needs to define serializable generic types, that use end-user defined types as type params. Example of such is Akka.DistributedData - where we have library type ORSet<T> which is serialized, persisted and passed over the wire, while the T parameter is specified by the akka library user.
Preserves object-graph information

Here, it means, that if we have the same object referenced from multiple fields inside serialized object graph, it should be also deserialized as a single object.
This could be propagated further eg. when we maintain a connection between two communicating nodes, we could maintain a cache of tracked objects that are repeatedly sent between them. Then instead of sending full payload each time, we could simply pass reference ID of that payload on subsequent calls within a scope of that connection.
Reflection-capable

This means, that we are able to serialize and deserialize reflection-specific data, like type member info or even AST trees in form of Expressions.
Version tolerant

Binary format must tolerate migrating back- and forward between different versions of the same message type. This is required for two reasons:

In distributed systems, rolling upgrades are popular technique of moving system from one version to another. In this case we want to be able for two nodes to communicate with each other, even if they describe the same message type in two different versions.
We want to be able to use this binary format for durable storage. In that case we want to be able to evolve message type while still being able to read old versions of it as long as possible.

From the case below following rules regarding version tolerance emerge:

Type name can change as long as its identifier was numeric and not implicit (see: stype).
Field names can change in any way.
New fields can be added.
Old fields can be removed if the field ids are used.
Field types can change in the boundaries of their type families.
For generic types, arity is allowed to change: in case of doubt a bottom type will be used.

Notation in diagrams

one byte:
+--------+
|        |
+--------+

a variable number of bytes:
+========+
|        |
+========+

variable number of objects stored in MessagePack format:
+~~~~~~~~~~~~~~~~~+
|                 |
+~~~~~~~~~~~~~~~~~+

Binary format

To describe a binary format, let's start from the example. Imagine we have following data type:
using Hyperion;

[Schema(15)] // opt-in: with implicit serializer you can omit schema id (see: stype)
class Message 
{
    // opt-in: with implicit serializer you order will be based on class fields order
    [field:Id(1)] 
    public int X { get; set; }
}
Now with this, we want to serialize following message: new Message { X = 123 }. The binary payload would look like the following:
0xa0 0x0f 0x01 0x10 0x7c 0x00

If we'd like to write that format down, this is what we'd get:
Type descriptor         Field id               Field value
    |                       |                       |
1010 0000   0000 1111   0000 0001   0001 0000   0111 1011   0000 0000
                |                       |                       |
             Type id          	    Field type		   End of the message


Type descriptor - since Hyperion is polymorphic serializer, each message contains information about its own type as a part of the payload. Here we inform that the type descriptor is number-based with type ID fitting into 1 byte (highest 4 bits: 1010).
Type ID - unique type identifier, always required for top level message. See: Type identifiers.
Field ID - to guarantee format backward and forward compatibility, payload must describe each field used inside (otherwise we would not be able to tolerate changes in message type). Field ID must be encoded within size of a single byte: this means that we have a limit of 256 fields per message.
Field type - next byte describes field type. In this case the highest 2 bits (00) tell us that we are using signed var int. Next 2 bits (01): that the value can be packed in one full byte.
Field value - since field type described content as 1 byte var int, this byte contains that value.
End message - a message delimiter, necessary to guarantee forward compatibility: in case when we got newer version of message, with fields that we don't understand, we need to be able to fast-forward until the message end.

Specification

The inspiration here is a binary format of Google Protocol Buffers and MsgPack. We use tagged fields  to achieve performance-wise version tolerance. Each payload starts with a single byte describing how to interpret the rest of the message, and will always end with empty byte (0x00) marking the end of the message (this is necessary to maintain not only backward, but also forward compatibility).
Hyperion format offers several core types (all binary data is stored using Big Endian encoding order) categorized in following families:

Var ints for integer values (both signed and unsigned) up to 128bits.
Floats for floating point numbers.
Refs used to preserve references inside serialized object graphs and serializer sessions.
Binaries for raw binary data.
Strings encoded as UTF-8, UTF-16 or Unicode.
Enumerables used to encode lenght-prefixed typed and untyped collections like arrays or maps.
Type IDs used to encode details about types/sub-types including awareness about runtime generic params.

Types inside the same family should be convertible as long as it doesn't result in data loss eg. int8 → uint18 as long as the value was positive or bin8 → bin32 (always). This allows to easily change type of message's field for example: from int to long without any changes in binary format. Cross family conversions are possible, but implementation dependent.
In general, type manifest can be encoded on a single byte.
Var int family


Type
Value (binary)
Value (hex)


fix int
0000 XXXX
0x00-0x0f


fix negative int
0100 XXXX
0x40-0x4f


uint8
0001 0000
0x10


uint16
0010 0000
0x20


uint32
0011 0000
0x30


uint64
0011 0001
0x31


uint128
0011 0010
0x32


int8
0101 0000
0x50


int16
0110 0000
0x60


int32
0111 0000
0x70


int64
0111 0001
0x71


All integer values are represented using 0SNN XXXX mask, where:

S informs if an integer is unsigned (0) or signed (1). For fix int, it informs about the the sign.
NN bits informs about it's size, for example:

00 describes a fix int (a number between 0 and (2^4)-1), in this case XXXX bits are used to encode that value directly. It's useful to efficiently encode values of limited size that can be represented as integers eg. boolean values or enums.
01 describes an int8/uint8 (a number between (2^4) and (2^8)-1). The next byte will be used to store a number. In this case XXXX bits must be all zeros.
10 describes an int16/uint16 (a number between (2^8) and (2^16)-1). The next 2 bytes will be used to store a number. In this case XXXX bits must be all zeros.
11 describes an int32/uint32 (a number between (2^16) and (2^32)-1). The next 4 bytes will be used to store a number.  In this case XXXX bits must be all zeros.


XXXX context depends on leading NN value:

For fix int (NN = 00) it's used to store value directly.
For int8/int16/int32/uint8/uint16/uint32 it must always be 0000.
For int64/uint64 it must be 0001. In this case the next 8 bytes will be used to store a number.
For uint128 it must be 0010. In this case the next 16 bytes will be used to store a value. This is especially useful for storing System.Guid.


Examples:
a positive fix int
+--------+
|0000XXXX|
+--------+

a signed int stored on 1 byte
+--------+--------+
|  0x50  |ZZZZZZZZ|
+--------+--------+

uint 32 stores a 32-bit big-endian unsigned integer
+--------+--------+--------+--------+--------+
|  0x30  |ZZZZZZZZ|ZZZZZZZZ|ZZZZZZZZ|ZZZZZZZZ|
+--------+--------+--------+--------+--------+


Float family


Type
Value (binary)
Value (hex)


float32
0111 0010
0x72


float64
0111 0011
0x73


float128
0111 0100
0x74


Floating point values as encoded using values:

0x72 is float32 (System.Single in .NET). The next 4 bytes are used to store the number.
0x73 is float64 (System.Double in .NET). The next 8 bytes are used to store the number.
0x74 is float128 (System.Decimal in .NET). The next 16 bytes are used to store the number.

Example:
float 32 stores a floating point number in IEEE 754 single precision floating point number format:
+--------+--------+--------+--------+--------+
|  0x72  |XXXXXXXX|XXXXXXXX|XXXXXXXX|XXXXXXXX|
+--------+--------+--------+--------+--------+

float 64 stores a floating point number in IEEE 754 double precision floating point number format:
+--------+--------+--------+--------+--------+--------+--------+--------+--------+
|  0x73  |YYYYYYYY|YYYYYYYY|YYYYYYYY|YYYYYYYY|YYYYYYYY|YYYYYYYY|YYYYYYYY|YYYYYYYY|
+--------+--------+--------+--------+--------+--------+--------+--------+--------+

References


Type
Value (binary)
Value (hex)


ref8
0001 1000
0x18


ref16
0001 1001
0x19


ref32
0001 1010
0x1a


cref8
0001 1100
0x1c


cref16
0001 1101
0x1d


cref32
0001 1110
0x1e


null cref
0001 1111
0x1f


Since Hyperion allows to serialize object graphs with reference tracking, a special prefix can be used to encode references. References stored this way will have a reference ID assigned by the serializer session that can be used (or overriden) by that serializer in scope of that session:

ref types are used to prefix an actual payload with its reference ID:

ref8 stores reference ID inside the next byte followed by payload type bytes and payload itself. Reference ID must be a number (2^4) to (2^8)-1.
ref16 stores reference ID inside the next 2 bytes followed by payload type bytes and payload itself. Reference ID must be a number (2^8) to (2^16)-1.
ref32 stores reference ID inside the next 4 bytes followed by payload type bytes and payload itself. Reference ID must be a number (2^16) to (2^32)-1.


cref types are used to pass only reference IDs themselves - in this case it's assumed that payload must have been passed previously using with correlated reference ID passed in corresponding ref type:

cref8 stores reference ID inside the next byte. Reference ID must be a number (2^4) to (2^8)-1.
cref16 stores reference ID inside the next 2 bytes. Reference ID must be a number (2^8) to (2^16)-1.
cref32 stores reference ID inside the next 4 bytes. Reference ID must be a number (2^16) to (2^32)-1.


null cref a null reference (or an empty value in case of value types). This is not the same as zero number (encoded as 0x00) for a simple reason: if we want to efficiently encode nullable types we need to discriminate between Nullable()/Nullable(0) (or None/Some(0) in F#) .

Binary family


Type
Value (binary)
Value (hex)


bin8
1000 0001
0x81


bin16
1000 0010
0x82


bin32
1000 0011
0x84


There are 3 binary format types, all used to store length-prefixed raw binary data:

bin8 is used to store binaries of 0 to (2^8)-1 bytes. In that case the actual length N is stored in the next byte followed by the N number of bytes being payload itself.
bin16 is used to store binaries of (2^8) to (2^16)-1 bytes. In that case the actual length N is stored in the next 2 bytes followed by the N number of bytes being payload itself.
bin32 is used to store binaries of (2^16) to (2^32)-1 bytes. In that case the actual length N is stored in the next 4 bytes followed by the N number of bytes being payload itself.

String family


Type
Value (binary)
Value (hex)


utf8
1000 01XX
0x85-0x87


utf16
1000 10XX
0x89-0x8b


utf32
1000 11XX
0x8d-0x8f


String families are used to encode length-prefixed binary payload (length describes length in bytes) to be interpreted as a string. There are 3 string types:

utf8 means that string is encoded UTF-8 encoding.
utf16 means that string is encoded using UTF-16 encoding.
utf32 means that string is encoded using Unicode encoding.

The XX component is used in the exactly same manner as in Binary family format described above.
Enumerable family


Type
Value (binary)
Value (hex)


enumerable
1001 XXYY
0x90-0x9f


Enumerable type describes length-prefixed, typeID-prefixed sequence of elements of any type. It can be used to describe both arrays, lists and maps. The XX mask is responsible for describing the arity of typeIDs described later on:

00 is untyped enumerable.
01 a single type ID will be defined, which is useful for most enumerable types (arrays, lists etc.)
10 two type IDs will be defined, which is useful for map types.
11 not used.

YY describes, how many bytes are used to store length of the enumerable:

00 means, this is an empty enumerable. In this case enumerables may contain values of different types (eg. System.Collections.IEnumerable). It's up to implementation to define how to handle it.
01 enumerable length is stored on a single byte, right after this one.
10 enumerable length is stored on the next 2 bytes after this one.
11 enumerable length is stored on the next 4 bytes after this one.

Examples:
empty enumerable of a single type 
+--------+~~~~~~~~~~~~~~~~~+
|  0x94  |     Type ID     |
+--------+~~~~~~~~~~~~~~~~~+

enumerable of a single type, whose length is upto (2^8)-1 elements 
+--------+--------+~~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~~+
|  0x95  |YYYYYYYY|     Type ID     |    N objects    |
+--------+--------+~~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~~+

enumerable of key-value pairs, whose length is upto (2^16)-1 elements
+--------+--------+--------+~~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~~+
|  0x9a  |XXXXXXXX|XXXXXXXX|   Key Type ID   |  Value Type ID  |   N*2 objects   |
+--------+--------+--------+~~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~~+

Type identifiers


Type
Value (binary)
Value (hex)


ntype8
1010 XXXX
0xb0-0xbf


ntype16
1100 XXXX
0xc0-0xcf


stype
1110 0000
0xe0


Since Hyperion is polymorphic serializer, it requires that the output type must be solely identified by the payload itself. Therefore a type identifier is included as a part of the payload, using one of the following formats:

ntype8 describes a type which ID is a number from 0 to (2^8)-1 range. The type ID will be included in the next byte. XXXX describes an N arity of generic type parameters used along this type. In that case after byte containing the identifier, the following bytes will describe N type identifiers of generic type params.
ntype16 describes a type which ID is a number from (2^8) to (2^16)-1 range. The type ID will be included in the next 2 bytes. XXXX describes an N arity of generic type parameters used along this type. In that case after byte containing the identifier, the following bytes will describe N type identifiers of generic type params.
stype is used for legacy reasons (Hyperion allowed to serialize types as is, without need of setting up type ID manually). In that case the following part will contain either a string payload containing a fully qualified type name (including generics) if preserving references option is off or a ref (see References) to that type.

From this design few limitations are quite visible:

Using stypes is highly discouraged, especially when payload is going to be stored in a persistent store.
Type ID must fit into short, which means up to 32767 types (negative values are reserved for types registered natively by Hyperion like. System.Int32, System.DateTime etc.).
Max allowed arity of a type must fit into 4 bit, which means that type ID cannot be used for types having more than 15 generic type parameters.

Keep in mind that type IDs are necessary to be encoded only in few situations:

For fields storing complex message types: this is necessity as serializer is supposed to work in an environment where subtyping is allowed (i.e. field of type Animal, where the actual payload is of type Dog, Cat etc).
For top level messages. This is necessary, for same reason as above. Additionally it's also required for primitive types. Example: given alone message 0x02 0x00 (which is fix int = 2), it's impossible to determine what output type is. It could be byte, short, int, long, char or bool (or user defined primitive type). While it's possible to infer that type easily when it's a field of larger message, it such value is self-contained, we need a type ID for it.
When a type is part of type ref param array for a wrapping generic type.

Examples:
ntype8 with no generic args i.e. System.Int32
+--------+--------+
|  0xb0  |XXXXXXXX|
+--------+--------+

ntype16 with 2 generic args i.e. KeyValuePair`2
+--------+--------+--------+~~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~~+
|  0xc2  |XXXXXXXX|XXXXXXXX|   1st Type ID   |   2nd Type ID   |
+--------+--------+--------+~~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~~+

stype having ref8 type having utf8 short string containing type manifest
+--------+--------+--------+--------+--------+~~~~~~~~~~~~~~~~~+
|  0xe0  |  0x18  |XXXXXXXX|  0x85  |YYYYYYYY|  type manifest  |
+--------+--------+--------+--------+--------+~~~~~~~~~~~~~~~~~+

stype having cref16 type referencing a type described before in same session
+--------+--------+--------+--------+
|  0xe0  |  0x1c  |XXXXXXXX|XXXXXXXX|
+--------+--------+--------+--------+
Type	Value (binary)	Value (hex)
fix int	0000 XXXX	0x00-0x0f
fix negative int	0100 XXXX	0x40-0x4f
uint8	0001 0000	0x10
uint16	0010 0000	0x20
uint32	0011 0000	0x30
uint64	0011 0001	0x31
uint128	0011 0010	0x32
int8	0101 0000	0x50
int16	0110 0000	0x60
int32	0111 0000	0x70
int64	0111 0001	0x71
Type	Value (binary)	Value (hex)
float32	0111 0010	0x72
float64	0111 0011	0x73
float128	0111 0100	0x74
Type	Value (binary)	Value (hex)
ref8	0001 1000	0x18
ref16	0001 1001	0x19
ref32	0001 1010	0x1a
cref8	0001 1100	0x1c
cref16	0001 1101	0x1d
cref32	0001 1110	0x1e
null cref	0001 1111	0x1f
Type	Value (binary)	Value (hex)
utf8	1000 01XX	0x85-0x87
utf16	1000 10XX	0x89-0x8b
utf32	1000 11XX	0x8d-0x8f
Type	Value (binary)	Value (hex)
ntype8	1010 XXXX	0xb0-0xbf
ntype16	1100 XXXX	0xc0-0xcf
stype	1110 0000	0xe0