kchristidis/protobuf-serialization.md

## protobuf-serialization.md

      
    Raw
  

              protobuf-serialization.md
            
          
    There doesn't seem to be a good resource online describing the issues with protocol buffers and deterministic serialization (or lack thereof). This is a collection of links on the subject.
Protocol Buffers v3.0.0. release notes:

The deterministic serialization is, however, NOT canonical across languages; it is also unstable across different builds with schema changes due to unknown fields.

Maps documentation:

Wire format ordering and map iteration ordering of map values is undefined, so you cannot rely on your map items being in a particular order.

Encoding & Field Order documentation:

While you can use field numbers in any order in a .proto, when a message is serialized its known fields should be written sequentially by field number, as in the provided C++, Java, and Python serialization code. This allows parsing code to use optimizations that rely on field numbers being in sequence. However, protocol buffer parsers must be able to parse fields in any order, as not all messages are created by simply serializing an object – for instance, it's sometimes useful to merge two messages by simply concatenating them.

Jason Bouzane

Proto3 does not help you. There are at least two places in proto3 that
allow equivalent messages to differ in their serialized form. One is
field order. While the proto3 specification recommends that fields be
written in numerical order, this is not required, and it explicitly
requires parsers to deal with fields out of order. The second is that
packed repeated fields may be specified any number of times and they
are to be concatenated. While the specification recommends against
encoding more than one packed repeated field for a particular tag
number in a message, it does require that parsers deal with this
situation correctly.
[...]
In any case, the upshot of this is that while a particular
implementation of the proto library may deterministically produce the
same serialized proto every time when given a particular proto
message, there's no guarantee that two different proto libraries will
serialize it in the same way, nor are there any guarantees that any
particular proto library serializer will be stable over time. While I
doubt that any official Google implementation would ever change the
serialization, third party implementations may do whatever they like.
For example, some serializers may choose to output the fields in hash
order instead of ascending order, and that could even make the
serialization non-deterministic between invocations of the program.

Feng Xiao:

The undeterministic comes from unknown fields and a new feature protobuf maps. If you can guarantee there are no such fields in your proto, the protobuf library will always serialize other fields ordered by field number and thus should output the same bytes.

Petteri Aimonen:

In general, the same data will serialize in exactly the same way.
However, this is not guaranteed by the protobuf specifications. For example, the following differences in encoding are allowable and must decode to the same result in all conforming libraries:


Encoding fields in different order than the tag number order.


Encoding packed fields as unpacked.


Encoding integers as longer varint byte sequences than needed.


Encoding same (non-repeated) field multiple times.


Probably others.


pherl:

The main concern that the deterministic serialization isn't canonical is due to the unknown fields. As string and message type share the same wire type, when parsing an unknown string/message type, the parser has no idea whether to recursively canonicalize the unknown field.The cross-language inconsistency is mainly due to the string fields comparison performance, i.e. java/objc uses utf16 encodings which has different orderings than utf8 strings due to surrogate pairs.