Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save Schm1tz1/d6d048a477c2119e9d7b6c9e694faaf2 to your computer and use it in GitHub Desktop.
Save Schm1tz1/d6d048a477c2119e9d7b6c9e694faaf2 to your computer and use it in GitHub Desktop.
Notes on protocol buffers and deterministic serialization (or lack thereof)

There doesn't seem to be a good resource online describing the issues with protocol buffers and deterministic serialization (or lack thereof). This is a collection of links on the subject.

Protocol Buffers v3.0.0. release notes:

The deterministic serialization is, however, NOT canonical across languages; it is also unstable across different builds with schema changes due to unknown fields.

Maps documentation:

Wire format ordering and map iteration ordering of map values is undefined, so you cannot rely on your map items being in a particular order.

Encoding & Field Order documentation:

While you can use field numbers in any order in a .proto, when a message is serialized its known fields should be written sequentially by field number, as in the provided C++, Java, and Python serialization code. This allows parsing code to use optimizations that rely on field numbers being in sequence. However, protocol buffer parsers must be able to parse fields in any order, as not all messages are created by simply serializing an object – for instance, it's sometimes useful to merge two messages by simply concatenating them.

Jason Bouzane

Proto3 does not help you. There are at least two places in proto3 that allow equivalent messages to differ in their serialized form. One is field order. While the proto3 specification recommends that fields be written in numerical order, this is not required, and it explicitly requires parsers to deal with fields out of order. The second is that packed repeated fields may be specified any number of times and they are to be concatenated. While the specification recommends against encoding more than one packed repeated field for a particular tag number in a message, it does require that parsers deal with this situation correctly.

[...]

In any case, the upshot of this is that while a particular implementation of the proto library may deterministically produce the same serialized proto every time when given a particular proto message, there's no guarantee that two different proto libraries will serialize it in the same way, nor are there any guarantees that any particular proto library serializer will be stable over time. While I doubt that any official Google implementation would ever change the serialization, third party implementations may do whatever they like. For example, some serializers may choose to output the fields in hash order instead of ascending order, and that could even make the serialization non-deterministic between invocations of the program.

Feng Xiao:

The undeterministic comes from unknown fields and a new feature protobuf maps. If you can guarantee there are no such fields in your proto, the protobuf library will always serialize other fields ordered by field number and thus should output the same bytes.

Petteri Aimonen:

In general, the same data will serialize in exactly the same way.

However, this is not guaranteed by the protobuf specifications. For example, the following differences in encoding are allowable and must decode to the same result in all conforming libraries:

  • Encoding fields in different order than the tag number order.

  • Encoding packed fields as unpacked.

  • Encoding integers as longer varint byte sequences than needed.

  • Encoding same (non-repeated) field multiple times.

  • Probably others.

pherl:

The main concern that the deterministic serialization isn't canonical is due to the unknown fields. As string and message type share the same wire type, when parsing an unknown string/message type, the parser has no idea whether to recursively canonicalize the unknown field.The cross-language inconsistency is mainly due to the string fields comparison performance, i.e. java/objc uses utf16 encodings which has different orderings than utf8 strings due to surrogate pairs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment