Skip to content

Instantly share code, notes, and snippets.

@Horusiath
Last active June 21, 2023 18:31
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save Horusiath/a153b225e3dcf24d5082b1be05108f72 to your computer and use it in GitHub Desktop.
Save Horusiath/a153b225e3dcf24d5082b1be05108f72 to your computer and use it in GitHub Desktop.
[DRAFT] Hyperion binary format

Hyperion

This is a draft for binary format and implementation proposal for Hyperion, a fast binary serializer for .NET. Currently Hyperion is in beta and uses non-standard binary format. This document aims to specify it. Before we do that, there are few important properties of the serializer we wish to maintain:

Polymorphic

Hyperion is polymorphic serializer, therefore it's tolerant for subtyping rules. Example using C# pseudo-code:

var serializer = new Serializer();
using (var stream = new MemoryStream())
{
	var writer = new StreamWriter(stream);
    var reader = new StreamReader(stream);
    
    // write int message
    int message = 123;
    serializer.Serialize(message, writer, StatelessSession.Default);
    
    stream.Position = 0;
    object decoded = serializer.Deserialize(reader, StatelessSession.Default);
    
    Console.WriteLine(decoded.GetType()); // System.Int32
}

Here, we can see that decoded message has been decoded into a correct type (int, the same as the original message) without requiring from deserialize method to specify output type.

Polymorphism also goes deeper i.e. given abstract class Animal with subtypes Dog and Cat, we can easily serialize and deserialize List<Animal> without specifying any extra steps.

Generic-type aware

This is useful for situations, where we have a library, that needs to define serializable generic types, that use end-user defined types as type params. Example of such is Akka.DistributedData - where we have library type ORSet<T> which is serialized, persisted and passed over the wire, while the T parameter is specified by the akka library user.

Preserves object-graph information

Here, it means, that if we have the same object referenced from multiple fields inside serialized object graph, it should be also deserialized as a single object.

This could be propagated further eg. when we maintain a connection between two communicating nodes, we could maintain a cache of tracked objects that are repeatedly sent between them. Then instead of sending full payload each time, we could simply pass reference ID of that payload on subsequent calls within a scope of that connection.

Reflection-capable

This means, that we are able to serialize and deserialize reflection-specific data, like type member info or even AST trees in form of Expressions.

Version tolerant

Binary format must tolerate migrating back- and forward between different versions of the same message type. This is required for two reasons:

  1. In distributed systems, rolling upgrades are popular technique of moving system from one version to another. In this case we want to be able for two nodes to communicate with each other, even if they describe the same message type in two different versions.
  2. We want to be able to use this binary format for durable storage. In that case we want to be able to evolve message type while still being able to read old versions of it as long as possible.

From the case below following rules regarding version tolerance emerge:

  • Type name can change as long as its identifier was numeric and not implicit (see: stype).
  • Field names can change in any way.
  • New fields can be added.
  • Old fields can be removed if the field ids are used.
  • Field types can change in the boundaries of their type families.
  • For generic types, arity is allowed to change: in case of doubt a bottom type will be used.

Notation in diagrams

one byte:
+--------+
|        |
+--------+

a variable number of bytes:
+========+
|        |
+========+

variable number of objects stored in MessagePack format:
+~~~~~~~~~~~~~~~~~+
|                 |
+~~~~~~~~~~~~~~~~~+

Binary format

To describe a binary format, let's start from the example. Imagine we have following data type:

using Hyperion;

[Schema(15)] // opt-in: with implicit serializer you can omit schema id (see: stype)
class Message 
{
    // opt-in: with implicit serializer you order will be based on class fields order
    [field:Id(1)] 
    public int X { get; set; }
}

Now with this, we want to serialize following message: new Message { X = 123 }. The binary payload would look like the following:

0xa0 0x0f 0x01 0x10 0x7c 0x00

If we'd like to write that format down, this is what we'd get:

Type descriptor         Field id               Field value
    |                       |                       |
1010 0000   0000 1111   0000 0001   0001 0000   0111 1011   0000 0000
                |                       |                       |
             Type id          	    Field type		   End of the message
  • Type descriptor - since Hyperion is polymorphic serializer, each message contains information about its own type as a part of the payload. Here we inform that the type descriptor is number-based with type ID fitting into 1 byte (highest 4 bits: 1010).
  • Type ID - unique type identifier, always required for top level message. See: Type identifiers.
  • Field ID - to guarantee format backward and forward compatibility, payload must describe each field used inside (otherwise we would not be able to tolerate changes in message type). Field ID must be encoded within size of a single byte: this means that we have a limit of 256 fields per message.
  • Field type - next byte describes field type. In this case the highest 2 bits (00) tell us that we are using signed var int. Next 2 bits (01): that the value can be packed in one full byte.
  • Field value - since field type described content as 1 byte var int, this byte contains that value.
  • End message - a message delimiter, necessary to guarantee forward compatibility: in case when we got newer version of message, with fields that we don't understand, we need to be able to fast-forward until the message end.

Specification

The inspiration here is a binary format of Google Protocol Buffers and MsgPack. We use tagged fields to achieve performance-wise version tolerance. Each payload starts with a single byte describing how to interpret the rest of the message, and will always end with empty byte (0x00) marking the end of the message (this is necessary to maintain not only backward, but also forward compatibility).

Hyperion format offers several core types (all binary data is stored using Big Endian encoding order) categorized in following families:

  • Var ints for integer values (both signed and unsigned) up to 128bits.
  • Floats for floating point numbers.
  • Refs used to preserve references inside serialized object graphs and serializer sessions.
  • Binaries for raw binary data.
  • Strings encoded as UTF-8, UTF-16 or Unicode.
  • Enumerables used to encode lenght-prefixed typed and untyped collections like arrays or maps.
  • Type IDs used to encode details about types/sub-types including awareness about runtime generic params.

Types inside the same family should be convertible as long as it doesn't result in data loss eg. int8uint18 as long as the value was positive or bin8bin32 (always). This allows to easily change type of message's field for example: from int to long without any changes in binary format. Cross family conversions are possible, but implementation dependent.

In general, type manifest can be encoded on a single byte.

Var int family

Type Value (binary) Value (hex)
fix int 0000 XXXX 0x00-0x0f
fix negative int 0100 XXXX 0x40-0x4f
uint8 0001 0000 0x10
uint16 0010 0000 0x20
uint32 0011 0000 0x30
uint64 0011 0001 0x31
uint128 0011 0010 0x32
int8 0101 0000 0x50
int16 0110 0000 0x60
int32 0111 0000 0x70
int64 0111 0001 0x71

All integer values are represented using 0SNN XXXX mask, where:

  • S informs if an integer is unsigned (0) or signed (1). For fix int, it informs about the the sign.
  • NN bits informs about it's size, for example:
    • 00 describes a fix int (a number between 0 and (2^4)-1), in this case XXXX bits are used to encode that value directly. It's useful to efficiently encode values of limited size that can be represented as integers eg. boolean values or enums.
    • 01 describes an int8/uint8 (a number between (2^4) and (2^8)-1). The next byte will be used to store a number. In this case XXXX bits must be all zeros.
    • 10 describes an int16/uint16 (a number between (2^8) and (2^16)-1). The next 2 bytes will be used to store a number. In this case XXXX bits must be all zeros.
    • 11 describes an int32/uint32 (a number between (2^16) and (2^32)-1). The next 4 bytes will be used to store a number. In this case XXXX bits must be all zeros.
  • XXXX context depends on leading NN value:
    • For fix int (NN = 00) it's used to store value directly.
    • For int8/int16/int32/uint8/uint16/uint32 it must always be 0000.
    • For int64/uint64 it must be 0001. In this case the next 8 bytes will be used to store a number.
    • For uint128 it must be 0010. In this case the next 16 bytes will be used to store a value. This is especially useful for storing System.Guid.

Examples:

a positive fix int
+--------+
|0000XXXX|
+--------+

a signed int stored on 1 byte
+--------+--------+
|  0x50  |ZZZZZZZZ|
+--------+--------+

uint 32 stores a 32-bit big-endian unsigned integer
+--------+--------+--------+--------+--------+
|  0x30  |ZZZZZZZZ|ZZZZZZZZ|ZZZZZZZZ|ZZZZZZZZ|
+--------+--------+--------+--------+--------+

Float family

Type Value (binary) Value (hex)
float32 0111 0010 0x72
float64 0111 0011 0x73
float128 0111 0100 0x74

Floating point values as encoded using values:

  • 0x72 is float32 (System.Single in .NET). The next 4 bytes are used to store the number.
  • 0x73 is float64 (System.Double in .NET). The next 8 bytes are used to store the number.
  • 0x74 is float128 (System.Decimal in .NET). The next 16 bytes are used to store the number.

Example:

float 32 stores a floating point number in IEEE 754 single precision floating point number format:
+--------+--------+--------+--------+--------+
|  0x72  |XXXXXXXX|XXXXXXXX|XXXXXXXX|XXXXXXXX|
+--------+--------+--------+--------+--------+

float 64 stores a floating point number in IEEE 754 double precision floating point number format:
+--------+--------+--------+--------+--------+--------+--------+--------+--------+
|  0x73  |YYYYYYYY|YYYYYYYY|YYYYYYYY|YYYYYYYY|YYYYYYYY|YYYYYYYY|YYYYYYYY|YYYYYYYY|
+--------+--------+--------+--------+--------+--------+--------+--------+--------+

References

Type Value (binary) Value (hex)
ref8 0001 1000 0x18
ref16 0001 1001 0x19
ref32 0001 1010 0x1a
cref8 0001 1100 0x1c
cref16 0001 1101 0x1d
cref32 0001 1110 0x1e
null cref 0001 1111 0x1f

Since Hyperion allows to serialize object graphs with reference tracking, a special prefix can be used to encode references. References stored this way will have a reference ID assigned by the serializer session that can be used (or overriden) by that serializer in scope of that session:

  1. ref types are used to prefix an actual payload with its reference ID:
    • ref8 stores reference ID inside the next byte followed by payload type bytes and payload itself. Reference ID must be a number (2^4) to (2^8)-1.
    • ref16 stores reference ID inside the next 2 bytes followed by payload type bytes and payload itself. Reference ID must be a number (2^8) to (2^16)-1.
    • ref32 stores reference ID inside the next 4 bytes followed by payload type bytes and payload itself. Reference ID must be a number (2^16) to (2^32)-1.
  2. cref types are used to pass only reference IDs themselves - in this case it's assumed that payload must have been passed previously using with correlated reference ID passed in corresponding ref type:
    • cref8 stores reference ID inside the next byte. Reference ID must be a number (2^4) to (2^8)-1.
    • cref16 stores reference ID inside the next 2 bytes. Reference ID must be a number (2^8) to (2^16)-1.
    • cref32 stores reference ID inside the next 4 bytes. Reference ID must be a number (2^16) to (2^32)-1.
  3. null cref a null reference (or an empty value in case of value types). This is not the same as zero number (encoded as 0x00) for a simple reason: if we want to efficiently encode nullable types we need to discriminate between Nullable()/Nullable(0) (or None/Some(0) in F#) .

Binary family

Type Value (binary) Value (hex)
bin8 1000 0001 0x81
bin16 1000 0010 0x82
bin32 1000 0011 0x84

There are 3 binary format types, all used to store length-prefixed raw binary data:

  • bin8 is used to store binaries of 0 to (2^8)-1 bytes. In that case the actual length N is stored in the next byte followed by the N number of bytes being payload itself.
  • bin16 is used to store binaries of (2^8) to (2^16)-1 bytes. In that case the actual length N is stored in the next 2 bytes followed by the N number of bytes being payload itself.
  • bin32 is used to store binaries of (2^16) to (2^32)-1 bytes. In that case the actual length N is stored in the next 4 bytes followed by the N number of bytes being payload itself.

String family

Type Value (binary) Value (hex)
utf8 1000 01XX 0x85-0x87
utf16 1000 10XX 0x89-0x8b
utf32 1000 11XX 0x8d-0x8f

String families are used to encode length-prefixed binary payload (length describes length in bytes) to be interpreted as a string. There are 3 string types:

  • utf8 means that string is encoded UTF-8 encoding.
  • utf16 means that string is encoded using UTF-16 encoding.
  • utf32 means that string is encoded using Unicode encoding.

The XX component is used in the exactly same manner as in Binary family format described above.

Enumerable family

Type Value (binary) Value (hex)
enumerable 1001 XXYY 0x90-0x9f

Enumerable type describes length-prefixed, typeID-prefixed sequence of elements of any type. It can be used to describe both arrays, lists and maps. The XX mask is responsible for describing the arity of typeIDs described later on:

  • 00 is untyped enumerable.
  • 01 a single type ID will be defined, which is useful for most enumerable types (arrays, lists etc.)
  • 10 two type IDs will be defined, which is useful for map types.
  • 11 not used.

YY describes, how many bytes are used to store length of the enumerable:

  • 00 means, this is an empty enumerable. In this case enumerables may contain values of different types (eg. System.Collections.IEnumerable). It's up to implementation to define how to handle it.
  • 01 enumerable length is stored on a single byte, right after this one.
  • 10 enumerable length is stored on the next 2 bytes after this one.
  • 11 enumerable length is stored on the next 4 bytes after this one.

Examples:

empty enumerable of a single type 
+--------+~~~~~~~~~~~~~~~~~+
|  0x94  |     Type ID     |
+--------+~~~~~~~~~~~~~~~~~+

enumerable of a single type, whose length is upto (2^8)-1 elements 
+--------+--------+~~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~~+
|  0x95  |YYYYYYYY|     Type ID     |    N objects    |
+--------+--------+~~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~~+

enumerable of key-value pairs, whose length is upto (2^16)-1 elements
+--------+--------+--------+~~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~~+
|  0x9a  |XXXXXXXX|XXXXXXXX|   Key Type ID   |  Value Type ID  |   N*2 objects   |
+--------+--------+--------+~~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~~+

Type identifiers

Type Value (binary) Value (hex)
ntype8 1010 XXXX 0xb0-0xbf
ntype16 1100 XXXX 0xc0-0xcf
stype 1110 0000 0xe0

Since Hyperion is polymorphic serializer, it requires that the output type must be solely identified by the payload itself. Therefore a type identifier is included as a part of the payload, using one of the following formats:

  • ntype8 describes a type which ID is a number from 0 to (2^8)-1 range. The type ID will be included in the next byte. XXXX describes an N arity of generic type parameters used along this type. In that case after byte containing the identifier, the following bytes will describe N type identifiers of generic type params.
  • ntype16 describes a type which ID is a number from (2^8) to (2^16)-1 range. The type ID will be included in the next 2 bytes. XXXX describes an N arity of generic type parameters used along this type. In that case after byte containing the identifier, the following bytes will describe N type identifiers of generic type params.
  • stype is used for legacy reasons (Hyperion allowed to serialize types as is, without need of setting up type ID manually). In that case the following part will contain either a string payload containing a fully qualified type name (including generics) if preserving references option is off or a ref (see References) to that type.

From this design few limitations are quite visible:

  • Using stypes is highly discouraged, especially when payload is going to be stored in a persistent store.
  • Type ID must fit into short, which means up to 32767 types (negative values are reserved for types registered natively by Hyperion like. System.Int32, System.DateTime etc.).
  • Max allowed arity of a type must fit into 4 bit, which means that type ID cannot be used for types having more than 15 generic type parameters.

Keep in mind that type IDs are necessary to be encoded only in few situations:

  1. For fields storing complex message types: this is necessity as serializer is supposed to work in an environment where subtyping is allowed (i.e. field of type Animal, where the actual payload is of type Dog, Cat etc).
  2. For top level messages. This is necessary, for same reason as above. Additionally it's also required for primitive types. Example: given alone message 0x02 0x00 (which is fix int = 2), it's impossible to determine what output type is. It could be byte, short, int, long, char or bool (or user defined primitive type). While it's possible to infer that type easily when it's a field of larger message, it such value is self-contained, we need a type ID for it.
  3. When a type is part of type ref param array for a wrapping generic type.

Examples:

ntype8 with no generic args i.e. System.Int32
+--------+--------+
|  0xb0  |XXXXXXXX|
+--------+--------+

ntype16 with 2 generic args i.e. KeyValuePair`2
+--------+--------+--------+~~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~~+
|  0xc2  |XXXXXXXX|XXXXXXXX|   1st Type ID   |   2nd Type ID   |
+--------+--------+--------+~~~~~~~~~~~~~~~~~+~~~~~~~~~~~~~~~~~+

stype having ref8 type having utf8 short string containing type manifest
+--------+--------+--------+--------+--------+~~~~~~~~~~~~~~~~~+
|  0xe0  |  0x18  |XXXXXXXX|  0x85  |YYYYYYYY|  type manifest  |
+--------+--------+--------+--------+--------+~~~~~~~~~~~~~~~~~+

stype having cref16 type referencing a type described before in same session
+--------+--------+--------+--------+
|  0xe0  |  0x1c  |XXXXXXXX|XXXXXXXX|
+--------+--------+--------+--------+

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment