This is a draft for binary format and implementation proposal for Hyperion, a fast binary serializer for .NET. Currently Hyperion is in beta and uses non-standard binary format. This document aims to specify it. Before we do that, there are few important properties of the serializer we wish to maintain:
Hyperion is polymorphic serializer, therefore it's tolerant for subtyping rules. Example using C# pseudo-code:
var serializer = new Serializer();
using (var stream = new MemoryStream())
{
var writer = new StreamWriter(stream);
var reader = new StreamReader(stream);
// write int message
int message = 123;
serializer.Serialize(message, writer, StatelessSession.Default);
stream.Position = 0;
object decoded = serializer.Deserialize(reader, StatelessSession.Default);
Console.WriteLine(decoded.GetType()); // System.Int32
}
Here, we can see that decoded
message has been decoded into a correct type (int
, the same as the original message) without requiring from deserialize method to specify output type.
Polymorphism also goes deeper i.e. given abstract class Animal
with subtypes Dog
and Cat
, we can easily serialize and deserialize List<Animal>
without specifying any extra steps.
This is useful for situations, where we have a library, that needs to define serializable generic types, that use end-user defined types as type params. Example of such is Akka.DistributedData - where we have library type ORSet<T>
which is serialized, persisted and passed over the wire, while the T
parameter is specified by the akka library user.
Here, it means, that if we have the same object referenced from multiple fields inside serialized object graph, it should be also deserialized as a single object.
This could be propagated further eg. when we maintain a connection between two communicating nodes, we could maintain a cache of tracked objects that are repeatedly sent between them. Then instead of sending full payload each time, we could simply pass reference ID of that payload on subsequent calls within a scope of that connection.
This means, that we are able to serialize and deserialize reflection-specific data, like type member info or even AST trees in form of Expressions.
Binary format must tolerate migrating back- and forward between different versions of the same message type. This is required for two reasons:
- In distributed systems, rolling upgrades are popular technique of moving system from one version to another. In this case we want to be able for two nodes to communicate with each other, even if they describe the same message type in two different versions.
- We want to be able to use this binary format for durable storage. In that case we want to be able to evolve message type while still being able to read old versions of it as long as possible.
From the case below following rules regarding version tolerance emerge:
- Type name can change as long as its identifier was numeric and not implicit (see:
stype
). - Field names can change in any way.
- New fields can be added.
- Old fields can be removed if the field ids are used.
- Field types can change in the boundaries of their type families.
- For generic types, arity is allowed to change: in case of doubt a bottom type will be used.
one byte:
+--------+
| |
+--------+
a variable number of bytes:
+========+
| |
+========+
variable number of objects stored in MessagePack format:
+~~~~~~~~~~~~~~~~~+
| |
+~~~~~~~~~~~~~~~~~+
To describe a binary format, let's start from the example. Imagine we have following data type:
using Hyperion;
[Schema(15)] // opt-in: with implicit serializer you can omit schema id (see: stype)
class Message
{
// opt-in: with implicit serializer you order will be based on class fields order
[field:Id(1)]
public int X { get; set; }
}
Now with this, we want to serialize following message: new Message { X = 123 }
. The binary payload would look like the following:
0xa0 0x0f 0x01 0x10 0x7c 0x00
If we'd like to write that format down, this is what we'd get:
Type descriptor Field id Field value
| | |
1010 0000 0000 1111 0000 0001 0001 0000 0111 1011 0000 0000
| | |
Type id Field type End of the message
- Type descriptor - since Hyperion is polymorphic serializer, each message contains information about its own type as a part of the payload. Here we inform that the type descriptor is number-based with type ID fitting into 1 byte (highest 4 bits:
1010
). - Type ID - unique type identifier, always required for top level message. See: Type identifiers.
- Field ID - to guarantee format backward and forward compatibility, payload must describe each field used inside (otherwise we would not be able to tolerate changes in message type). Field ID must be encoded within size of a single byte: this means that we have a limit of 256 fields per message.
- Field type - next byte describes field type. In this case the highest 2 bits (
00
) tell us that we are using signed var int. Next 2 bits (01
): that the value can be packed in one full byte. - Field value - since field type described content as 1 byte var int, this byte contains that value.
- End message - a message delimiter, necessary to guarantee forward compatibility: in case when we got newer version of message, with fields that we don't understand, we need to be able to fast-forward until the message end.
The inspiration here is a binary format of Google Protocol Buffers and MsgPack. We use tagged fields to achieve performance-wise version tolerance. Each payload starts with a single byte describing how to interpret the rest of the message, and will always end with empty byte (0x00) marking the end of the message (this is necessary to maintain not only backward, but also forward compatibility).
Hyperion format offers several core types (all binary data is stored using Big Endian encoding order) categorized in following families:
- Var ints for integer values (both signed and unsigned) up to 128bits.
- Floats for floating point numbers.
- Refs used to preserve references inside serialized object graphs and serializer sessions.
- Binaries for raw binary data.
- Strings encoded as UTF-8, UTF-16 or Unicode.
- Enumerables used to encode lenght-prefixed typed and untyped collections like arrays or maps.
- Type IDs used to encode details about types/sub-types including awareness about runtime generic params.
Types inside the same family should be convertible as long as it doesn't result in data loss eg. int8
→ uint18
as long as the value was positive or bin8
→ bin32
(always). This allows to easily change type of message's field for example: from int
to long
without any changes in binary format. Cross family conversions are possible, but implementation dependent.
In general, type manifest can be encoded on a single byte.
Type | Value (binary) | Value (hex) |
---|---|---|
fix int | 0000 XXXX | 0x00-0x0f |
fix negative int | 0100 XXXX | 0x40-0x4f |
uint8 | 0001 0000 | 0x10 |
uint16 | 0010 0000 | 0x20 |
uint32 | 0011 0000 | 0x30 |
uint64 | 0011 0001 | 0x31 |
uint128 | 0011 0010 | 0x32 |
int8 | 0101 0000 | 0x50 |
int16 | 0110 0000 | 0x60 |
int32 | 0111 0000 | 0x70 |
int64 | 0111 0001 | 0x71 |
All integer values are represented using 0SNN XXXX
mask, where:
S
informs if an integer is unsigned (0
) or signed (1
). For fix int, it informs about the the sign.NN
bits informs about it's size, for example:00
describes a fix int (a number between 0 and (2^4)-1), in this caseXXXX
bits are used to encode that value directly. It's useful to efficiently encode values of limited size that can be represented as integers eg. boolean values or enums.01
describes anint8
/uint8
(a number between (2^4) and (2^8)-1). The next byte will be used to store a number. In this caseXXXX
bits must be all zeros.10
describes anint16
/uint16
(a number between (2^8) and (2^16)-1). The next 2 bytes will be used to store a number. In this caseXXXX
bits must be all zeros.11
describes anint32
/uint32
(a number between (2^16) and (2^32)-1). The next 4 bytes will be used to store a number. In this caseXXXX
bits must be all zeros.
XXXX
context depends on leadingNN
value:- For fix int (
NN
=00
) it's used to store value directly. - For
int8
/int16
/int32
/uint8
/uint16
/uint32
it must always be0000
. - For
int64
/uint64
it must be0001
. In this case the next 8 bytes will be used to store a number. - For
uint128
it must be0010
. In this case the next 16 bytes will be used to store a value. This is especially useful for storingSystem.Guid
.
- For fix int (
Examples:
a positive fix int
+--------+
|0000XXXX|
+--------+
a signed int stored on 1 byte
+--------+--------+
| 0x50 |ZZZZZZZZ|
+--------+--------+
uint 32 stores a 32-bit big-endian unsigned integer
+--------+--------+--------+--------+--------+
| 0x30 |ZZZZZZZZ|ZZZZZZZZ|ZZZZZZZZ|ZZZZZZZZ|
+--------+--------+--------+--------+--------+
Type | Value (binary) | Value (hex) |
---|---|---|
float32 | 0111 0010 | 0x72 |
float64 | 0111 0011 | 0x73 |
float128 | 0111 0100 | 0x74 |
Floating point values as encoded using values:
0x72
is float32 (System.Single
in .NET). The next 4 bytes are used to store the number.0x73
is float64 (System.Double
in .NET). The next 8 bytes are used to store the number.0x74
is float128 (System.Decimal
in .NET). The next 16 bytes are used to store the number.
Example:
float 32 stores a floating point number in IEEE 754 single precision floating point number format:
+--------+--------+--------+--------+--------+
| 0x72 |XXXXXXXX|XXXXXXXX|XXXXXXXX|XXXXXXXX|
+--------+--------+--------+--------+--------+
float 64 stores a floating point number in IEEE 754 double precision floating point number format:
+--------+--------+--------+--------+--------+--------+--------+--------+--------+
| 0x73 |YYYYYYYY|YYYYYYYY|YYYYYYYY|YYYYYYYY|YYYYYYYY|YYYYYYYY|YYYYYYYY|YYYYYYYY|
+--------+--------+--------+--------+--------+--------+--------+--------+--------+
Type | Value (binary) | Value (hex) |
---|---|---|
ref8 | 0001 1000 | 0x18 |
ref16 | 0001 1001 | 0x19 |
ref32 | 0001 1010 | 0x1a |
cref8 | 0001 1100 | 0x1c |
cref16 | 0001 1101 | 0x1d |
cref32 | 0001 1110 | 0x1e |
null cref | 0001 1111 | 0x1f |
Since Hyperion allows to serialize object graphs with reference tracking, a special prefix can be used to encode references. References stored this way will have a reference ID assigned by the serializer session that can be used (or overriden) by that serializer in scope of that session:
ref
types are used to prefix an actual payload with its reference ID:ref8
stores reference ID inside the next byte followed by payload type bytes and payload itself. Reference ID must be a number (2^4) to (2^8)-1.ref16
stores reference ID inside the next 2 bytes followed by payload type bytes and payload itself. Reference ID must be a number (2^8) to (2^16)-1.ref32
stores reference ID inside the next 4 bytes followed by payload type bytes and payload itself. Reference ID must be a number (2^16) to (2^32)-1.
cref
types are used to pass only reference IDs themselves - in this case it's assumed that payload must have been passed previously using with correlated reference ID passed in correspondingref
type:cref8
stores reference ID inside the next byte. Reference ID must be a number (2^4) to (2^8)-1.cref16
stores reference ID inside the next 2 bytes. Reference ID must be a number (2^8) to (2^16)-1.cref32
stores reference ID inside the next 4 bytes. Reference ID must be a number (2^16) to (2^32)-1.
null cref
a null reference (or an empty value in case of value types). This is not the same as zero number (encoded as0x00
) for a simple reason: if we want to efficiently encode nullable types we need to discriminate betweenNullable()
/Nullable(0)
(orNone
/Some(0)
in F#) .
Type | Value (binary) | Value (hex) |
---|---|---|
bin8 | 1000 0001 | 0x81 |
bin16 | 1000 0010 | 0x82 |
bin32 | 1000 0011 | 0x84 |
There are 3 binary format types, all used to store length-prefixed raw binary data:
bin8
is used to store binaries of 0 to (2^8)-1 bytes. In that case the actual lengthN
is stored in the next byte followed by theN
number of bytes being payload itself.bin16
is used to store binaries of (2^8) to (2^16)-1 bytes. In that case the actual lengthN
is stored in the next 2 bytes followed by theN
number of bytes being payload itself.bin32
is used to store binaries of (2^16) to (2^32)-1 bytes. In that case the actual lengthN
is stored in the next 4 bytes followed by theN
number of bytes being payload itself.
Type | Value (binary) | Value (hex) |
---|---|---|
utf8 | 1000 01XX | 0x85-0x87 |
utf16 | 1000 10XX | 0x89-0x8b |
utf32 | 1000 11XX | 0x8d-0x8f |
String families are used to encode length-prefixed binary payload (length describes length in bytes) to be interpreted as a string. There are 3 string types:
utf8
means that string is encoded UTF-8 encoding.utf16
means that string is encoded using UTF-16 encoding.utf32
means that string is encoded using Unicode encoding.
The XX
component is used in the exactly same manner as in Binary family format described above.
Type | Value (binary) | Value (hex) |
---|---|---|
enumerable | 1001 XXYY | 0x90-0x9f |
Enumerable type describes length-prefixed, typeID-prefixed sequence of elements of any type. It can be used to describe both arrays, lists and maps. The XX
mask is responsible for describing the arity of typeIDs described later on:
00
is untyped enumerable.01
a single type ID will be defined, which is useful for most enumerable types (arrays, lists etc.)10
two type IDs will be defined, which is useful for map types.11
not used.
YY
describes, how many bytes are used to store length of the enumerable:
00
means, this is an empty enumerable. In this case enumerables may contain values of different types (eg.System.Collections.IEnumerable
). It's up to implementation to define how to handle it.01
enumerable length is stored on a single byte, right after this one.10
enumerable length is stored on the next 2 bytes after this one.11
enumerable length is stored on the next 4 bytes after this one.
Examples:
empty enumerable of a single type
+--------+~~~~~~~~~~~~~~~~~+
| 0x94 | Type ID |
+--------+~~~~~~~~~~~~~~~~~+
enumerable of a single type, whose length is upto (2^8)-1 elements
+--------+--------+~~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~~+
| 0x95 |YYYYYYYY| Type ID | N objects |
+--------+--------+~~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~~+
enumerable of key-value pairs, whose length is upto (2^16)-1 elements
+--------+--------+--------+~~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~~+
| 0x9a |XXXXXXXX|XXXXXXXX| Key Type ID | Value Type ID | N*2 objects |
+--------+--------+--------+~~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~~+
Type | Value (binary) | Value (hex) |
---|---|---|
ntype8 | 1010 XXXX | 0xb0-0xbf |
ntype16 | 1100 XXXX | 0xc0-0xcf |
stype | 1110 0000 | 0xe0 |
Since Hyperion is polymorphic serializer, it requires that the output type must be solely identified by the payload itself. Therefore a type identifier is included as a part of the payload, using one of the following formats:
ntype8
describes a type which ID is a number from 0 to (2^8)-1 range. The type ID will be included in the next byte.XXXX
describes anN
arity of generic type parameters used along this type. In that case after byte containing the identifier, the following bytes will describeN
type identifiers of generic type params.ntype16
describes a type which ID is a number from (2^8) to (2^16)-1 range. The type ID will be included in the next 2 bytes.XXXX
describes anN
arity of generic type parameters used along this type. In that case after byte containing the identifier, the following bytes will describeN
type identifiers of generic type params.stype
is used for legacy reasons (Hyperion allowed to serialize types as is, without need of setting up type ID manually). In that case the following part will contain either a string payload containing a fully qualified type name (including generics) if preserving references option is off or aref
(see References) to that type.
From this design few limitations are quite visible:
- Using
stype
s is highly discouraged, especially when payload is going to be stored in a persistent store. - Type ID must fit into short, which means up to 32767 types (negative values are reserved for types registered natively by Hyperion like.
System.Int32
,System.DateTime
etc.). - Max allowed arity of a type must fit into 4 bit, which means that type ID cannot be used for types having more than 15 generic type parameters.
Keep in mind that type IDs are necessary to be encoded only in few situations:
- For fields storing complex message types: this is necessity as serializer is supposed to work in an environment where subtyping is allowed (i.e. field of type
Animal
, where the actual payload is of typeDog
,Cat
etc). - For top level messages. This is necessary, for same reason as above. Additionally it's also required for primitive types. Example: given alone message
0x02 0x00
(which is fix int = 2), it's impossible to determine what output type is. It could be byte, short, int, long, char or bool (or user defined primitive type). While it's possible to infer that type easily when it's a field of larger message, it such value is self-contained, we need a type ID for it. - When a type is part of type ref param array for a wrapping generic type.
Examples:
ntype8 with no generic args i.e. System.Int32
+--------+--------+
| 0xb0 |XXXXXXXX|
+--------+--------+
ntype16 with 2 generic args i.e. KeyValuePair`2
+--------+--------+--------+~~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~~+
| 0xc2 |XXXXXXXX|XXXXXXXX| 1st Type ID | 2nd Type ID |
+--------+--------+--------+~~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~~+
stype having ref8 type having utf8 short string containing type manifest
+--------+--------+--------+--------+--------+~~~~~~~~~~~~~~~~~+
| 0xe0 | 0x18 |XXXXXXXX| 0x85 |YYYYYYYY| type manifest |
+--------+--------+--------+--------+--------+~~~~~~~~~~~~~~~~~+
stype having cref16 type referencing a type described before in same session
+--------+--------+--------+--------+
| 0xe0 | 0x1c |XXXXXXXX|XXXXXXXX|
+--------+--------+--------+--------+