Skip to content

Instantly share code, notes, and snippets.

@kchristidis
Last active April 12, 2024 20:09
Show Gist options
  • Save kchristidis/39c8b310fd9da43d515c4394c3cd9510 to your computer and use it in GitHub Desktop.
Save kchristidis/39c8b310fd9da43d515c4394c3cd9510 to your computer and use it in GitHub Desktop.
Notes on protocol buffers and deterministic serialization (or lack thereof)

There doesn't seem to be a good resource online describing the issues with protocol buffers and deterministic serialization (or lack thereof). This is a collection of links on the subject.

Protocol Buffers v3.0.0. release notes:

The deterministic serialization is, however, NOT canonical across languages; it is also unstable across different builds with schema changes due to unknown fields.

Maps documentation:

Wire format ordering and map iteration ordering of map values is undefined, so you cannot rely on your map items being in a particular order.

Encoding & Field Order documentation:

While you can use field numbers in any order in a .proto, when a message is serialized its known fields should be written sequentially by field number, as in the provided C++, Java, and Python serialization code. This allows parsing code to use optimizations that rely on field numbers being in sequence. However, protocol buffer parsers must be able to parse fields in any order, as not all messages are created by simply serializing an object – for instance, it's sometimes useful to merge two messages by simply concatenating them.

Jason Bouzane

Proto3 does not help you. There are at least two places in proto3 that allow equivalent messages to differ in their serialized form. One is field order. While the proto3 specification recommends that fields be written in numerical order, this is not required, and it explicitly requires parsers to deal with fields out of order. The second is that packed repeated fields may be specified any number of times and they are to be concatenated. While the specification recommends against encoding more than one packed repeated field for a particular tag number in a message, it does require that parsers deal with this situation correctly.

[...]

In any case, the upshot of this is that while a particular implementation of the proto library may deterministically produce the same serialized proto every time when given a particular proto message, there's no guarantee that two different proto libraries will serialize it in the same way, nor are there any guarantees that any particular proto library serializer will be stable over time. While I doubt that any official Google implementation would ever change the serialization, third party implementations may do whatever they like. For example, some serializers may choose to output the fields in hash order instead of ascending order, and that could even make the serialization non-deterministic between invocations of the program.

Feng Xiao:

The undeterministic comes from unknown fields and a new feature protobuf maps. If you can guarantee there are no such fields in your proto, the protobuf library will always serialize other fields ordered by field number and thus should output the same bytes.

Petteri Aimonen:

In general, the same data will serialize in exactly the same way.

However, this is not guaranteed by the protobuf specifications. For example, the following differences in encoding are allowable and must decode to the same result in all conforming libraries:

  • Encoding fields in different order than the tag number order.

  • Encoding packed fields as unpacked.

  • Encoding integers as longer varint byte sequences than needed.

  • Encoding same (non-repeated) field multiple times.

  • Probably others.

pherl:

The main concern that the deterministic serialization isn't canonical is due to the unknown fields. As string and message type share the same wire type, when parsing an unknown string/message type, the parser has no idea whether to recursively canonicalize the unknown field.The cross-language inconsistency is mainly due to the string fields comparison performance, i.e. java/objc uses utf16 encodings which has different orderings than utf8 strings due to surrogate pairs.

@anderson-dan-w
Copy link

Thanks for consolidating these, making it clear that things aren't super clear. Exactly what I needed to be sure (sure that I can't count on deterministic serialization, that is).

@MBoldyrev
Copy link

Thank you for bringing this together. Please fix the Encoding & Field Order documentation link in your gist, it leads to this same gist now (I think it was supposed to point at the docs). Also, here are several other related snippets:

From C++ API documentation on method SetSerializationDeterministic that enables deterministic serialization (disabled by default):
https://github.com/protocolbuffers/protobuf/blob/a1bb147e96b6f74db6cdf3c3fcb00492472dbbfa/src/google/protobuf/io/coded_stream.h#L834-L846

// Deterministic serialization, if requested, guarantees that for a given
// binary, equal messages will always be serialized to the same bytes. This
// implies:
// . repeated serialization of a message will return the same bytes
// . different processes of the same binary (which may be executing on
// different machines) will serialize equal messages to the same bytes.
//
// Note the deterministic serialization is NOT canonical across languages; it
// is also unstable across different builds with schema changes due to unknown
// fields. Users who need canonical serialization, e.g., persistent storage in
// a canonical form, fingerprinting, etc., should define their own
// canonicalization specification and implement the serializer using
// reflection APIs rather than relying on this API.

There is an analogous method in Java API with a similar note.

Encoding docs:

By default, repeated invocations of serialization methods on the same protocol buffer message instance may not return the same byte output; i.e. the default serialization is not deterministic.
Deterministic serialization only guarantees the same byte output for a particular binary. The byte output may change across different versions of the binary.

@kchristidis
Copy link
Author

@MBoldyrev: Thanks for suggesting the edit (done), and for adding more snippets!

@rsmets
Copy link

rsmets commented Aug 18, 2021

Thank you for the comprehensive references on the topic. I find it odd that still, there is no way to cleanly enforce deterministic byte serialization with protos. Everything was smooth sailing for us across various languages (js, java, swift) until we started to handle signatures over a message with a Struct field. =/

@fmg-lydonchandra
Copy link

Is this still current ? or has any of the above gone stale ?

@cheako
Copy link

cheako commented May 23, 2023

A valid question, but I sus your motivations. Unless you need to be told that deterministic is never expected to be a consideration and that hasn't changed. I think chat would be a better place for this discussion.

@fmg-lydonchandra
Copy link

Ok thanks for confirming @cheako , i was under the impression that when deterministic serialization is used, and schema is identical, then serialization will produce same binary result between Java library and C++ library.
Obviously I am incorrect.

Found this note from protobuf Java binding.

Note the deterministic serialization is NOT canonical across languages; it is also unstable * across different builds with schema changes due to unknown fields. Users who need canonical * serialization, e.g. persistent storage in a canonical form, fingerprinting, etc, should define * their own canonicalization specification and implement the serializer using reflection APIs * rather than relying on this API.

@cheako
Copy link

cheako commented May 23, 2023

You could use rust with JNI to get Hash and Eq trait implementations derived.

@fmg-lydonchandra
Copy link

Will the byte output be guaranteed to be IDENTICAL between Windows-built and Linux-built utilizing the same C++ protobuf version and same message schema (when Deterministic serialization is true) ?

Encoding docs:

By default, repeated invocations of serialization methods on the same protocol buffer message instance may not return the same byte output; i.e. the default serialization is not deterministic.
Deterministic serialization only guarantees the same byte output for a particular binary. The byte output may change across different versions of the binary.

@caspermeijn
Copy link

caspermeijn commented Feb 13, 2024

The text linked to as Encoding & Field Order documentation has changed since the creation of this document.

New text:

Field numbers may be declared in any order in a .proto file. The order chosen has no effect on how the messages are serialized.

When a message is serialized, there is no guaranteed order for how its known or unknown fields will be written. Serialization order is an implementation detail, and the details of any particular implementation may change in the future. Therefore, protocol buffer parsers must be able to parse fields in any order.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment