Skip to content

Instantly share code, notes, and snippets.

@dustmop
Created May 11, 2018 19:28
Show Gist options
  • Save dustmop/5c7d1a1dee66789a13fd736b0f453d9f to your computer and use it in GitHub Desktop.
Save dustmop/5c7d1a1dee66789a13fd736b0f453d9f to your computer and use it in GitHub Desktop.
BinaryDataTranslator for CBOR to/from Proto conversion
CBOR <-> Proto conversion
Nameing:
BinaryDataTranslator
Goal: Convert binary-encoded CBOR to binary-encoded Protobuf, and vise versa, without the need to allocate fully-inflated object representations of either binary format. In other words, read bytes from a buffer containing one encoding and directly write bytes to a buffer representing the other encoding. The number of allocations should be minimized; ideally there should only be 1 allocation if the objects are small enough.
Similarities:
* Both formats use a small number of primitive types.
* Most types are similar, for example CBOR's int type (0, 1) resembles Proto's varint (0).
* Type tags are bit packed with other relevant information.
* Strings are length-prefixed.
* Integers of multiple precisions are supported, with small integers taking less bytes.
* Both support opaque blobs of binary data which is not to be decoded.
Differences:
* Protobuf uses a schema file that needs to be precompiled ahead of time.
* CBOR encodes field names into the binary format.
* Type tags in Protobuf are packed with field numbers, while in CBOR they are packed with "information" of variadic meaning, depending upon the type.
* CBOR begins a "group" (inner structured data) with a count of how many elements are in the group. Protobuf encodes how many bytes each element of the "group" takes up.
* CBOR supports heterogenous arrays, Proto does not.
Ordering:
The order of fields between Proto and CBOS may not match. While the protobuf format requires fields to appear in their field order (by numerical tags), there is no such requirement for CBOR. The CBOR format may use the "canonical" form, which sorts the field alphabetically, but this is not strictly required either. As a result, it may be necessary to write the destination format out of order from how the source format is read.
Schema:
The protobuf compiler can output a "descriptor" file using the flag "--descriptor_set_out". This is an easily parsable binary file that includes both the field names, and field tag numbers. By loading this file once, we can construct a two-way mapping between CBOR field names and protobuf field tags. Then, it will be possible to seamlessly convert to and from CBOR and Proto.
Interface:
Create a "BinaryDataTranslator" object, taking the descriptor file as input. The Translator parses the descriptor and builds a two way mapping from CBOR fields to Protobuf fields. In addition, it should collect how many field names appear, since the CBOR wire format includes these, and needs to know how many bytes are used up. The Translator can then be used to convert either format to the other. It has two methods: ConvertCBORToProto and ConvertProtoToCBOR.
Algorithm:
Write the destination format in order of how the fields should be serialized, from start to finish. To do so, collect the fields that exist in the source object, map them to the destination object's layout, and sort them as needed.
Type Mapping:
-----
CBOR Proto
0 positive integer -> 0 varint
1 negative integr -> 0 varint
2 bytes -> 2 bytes
3 text utf-8 -> 2 string
4 array -> * (a repeated list of array elements)
5 map -> 2 message (sub-protobuf structure)
6 N/A
7 primitive -> 0 (bool), 1 (64-bit float), 5 (32-bit float)
-----
Proto
0 varint -> 0 (if positive), 1 (if negative), 7 (primitive)
1 64 bit fixed -> 7 (floating point), otherwise same as 0
2 length-delim -> 2 (bytes), 3 (text), 5 (embedded message)
3 N/A
4 N/A
5 32 bit fixed -> same as 1
Caveats:
Proto does not support heterogeneous arrays, while CBOR does since it is more closely mirroring JSON. There are two possible solutions for this:
* Use a "union" type that wraps each element of a heterogenous array.
* Encode a heterogenous array as a binary blob, not to be decoded.
For Qri's purpose, our CBOR <-> Proto transformations will only take place for structures defined in Go source code, which don't allow heterogenous arrrays. However, there's one exception: Structure uses jsonschema, which by necessity uses JSON, and can have hetereogenous arrays. For this single case, we will encode the schema as a binary blob.
Converting numbers between CBOR and Proto is complicated, due to the difference in binary encoding. CBOR uses tag prefixed sizes: 24 for 8bit int, 25 or 16bit int, 26 for 32bit int, and 27 for 64bit int. However Proto uses varint encoding, using an arbitrary number of bytes, with high bits representing how many bytes are needed. Generally speaking, it cannot be determined how many bytes an encoded integer uses without inspecting its magnitude.
References:
https://developers.google.com/protocol-buffers/docs/encoding
https://tools.ietf.org/html/rfc7049
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment