luciopaiva/README.md

## README.md

      
    Raw
  

              README.md
            
          
    Protobuffer-safe bytes for proprietary protocol formats

In a situation where peers can exchange messages in either protobuf or a proprietary format, there must be a way for the recipient to identify whether the incoming message is a protobuf or not.
The simplest solution for that would be to add a header to each message informing the recipient what the payload type is. Let's say, however, that there is an existing protocol using protobuf messages and a proprietary format option must be added without breaking compatibility with existing implementations.
The idea is to pick a byte that will be sent at the beginning of the message and will let the recipient know for sure if it's a protobuf or proprietary format. For that, one has to answer the question: what values are valid first bytes in a protobuf message?
From the documentation:

A protobuf message is a series of key-value pairs. [...] When a message is encoded, each key-value pair is turned into a record consisting of the field number, a wire type and a payload.

Field number and wire type are encoded together and come first. They are encoded as varint.
Since the first byte of a protobuf message has to be a tag, all that needs to be done is identify all values that encode valid tags and avoid them when sending messages in proprietary format.
The first byte in a protobuf message will look like this:
cnnnttt

Where:

c is the flag which indicates if this is the last byte in the varint
n are 3 bits for encoding the field number
t are 3 bits for encoding the wire type

Making use of undefined wire types

Since there a only 6 known wire types, ttt needs to be something between b000 and b101. Values 6 (b110) and 7 (b111) do not encode anything valid (as of protobuf v3) and could be used to identify a proprietary protocol. Based on that, the following ranges are available:

bytes with the least significant nibble set to 0x6 (0b????0110) - 0x06, 0x16, 0x26, etc
bytes with the least significant nibble set to 0xe (0b????1110) - 0x0e, 0x1e, 0x2e, etc
bytes with the least significant nibble set to 0x7 (0b????0111) - 0x07, 0x17, 0x27, etc
bytes with the least significant nibble set to 0xf (0b????1111) - 0x0f, 0x1f, 0x2f, etc

Making use of deprecated wire types

One could also take advantage on the fact that wire types 3 and 4 are deprecated. In case they are not being used by the application, it is safe to assume they can be used for a proprietary format. This gives these four extra ranges:

bytes with the least significant nibble set to 0x3 (0b????0011) - 0x03, 0x13, 0x23, etc
bytes with the least significant nibble set to 0xb (0b????1011) - 0x0b, 0x1b, 0x2b, etc
bytes with the least significant nibble set to 0x4 (0b????0100) - 0x04, 0x14, 0x24, etc
bytes with the least significant nibble set to 0xc (0b????1100) - 0x0c, 0x1c, 0x2c, etc

Making use of field number zero

Additionally, one could also consider the fact that field numbers must be positive integers, so the following ranges are also available:

0b00000000 to 0b00000111 (0x00 to 0x07)
0b10000000 to 0b10000111 (0x80 to 0x87)

Final remarks

One just has to have in mind that the use of these ranges is not guaranteed to be future-proof, as subsequent protobuf versions may break it. The ranges that encode field number zero are probably more safe to use, since the field number rule has no big reason to change and the protocol designer could just avoid those numbers when creating messages anyway. Ranges for the deprecated wire types are also probably safe since future protobuf versions need to respect that to be backwards compatible. One should prefer those ranges over the ones for the unknown wire types.
It is also desirable that the proprietary format check has precedence over the protobuf parser, so unnecessary parser exceptions can be avoided. Nevertheless, the protobuf parser is always supposed to break when one of the aforementioned byte ranges are used, since the parser has no way to continue after it encounters an unknown wire type as it can't know how many bytes should it skip to continue reading the rest of the message.
Summary of usable bytes

A summary of all unique bytes that can be used by a proprietary format protocol:
0x00 0x01 0x02 0x03 0x04 0x05 0x06 0x07                                        
0x80 0x81 0x82 0x83 0x84 0x85 0x86 0x87                                        
     0x16 0x26 0x36 0x46 0x56 0x66 0x76      0x96 0xa6 0xb6 0xc6 0xd6 0xe6 0xf6
     0x17 0x27 0x37 0x47 0x57 0x67 0x77      0x97 0xa7 0xb7 0xc7 0xd7 0xe7 0xf7
0x0e 0x1e 0x2e 0x3e 0x4e 0x5e 0x6e 0x7e 0x8e 0x9e 0xae 0xbe 0xce 0xde 0xee 0xfe
0x0f 0x1f 0x2f 0x3f 0x4f 0x5f 0x6f 0x7f 0x8f 0x9f 0xaf 0xbf 0xcf 0xdf 0xef 0xff
     0x13 0x23 0x33 0x43 0x53 0x63 0x73 0x83 0x93 0xa3 0xb3 0xc3 0xd3 0xe3 0xf3
0x0b 0x1b 0x2b 0x3b 0x4b 0x5b 0x6b 0x7b 0x8b 0x9b 0xab 0xbb 0xcb 0xdb 0xeb 0xfb
     0x14 0x24 0x34 0x44 0x54 0x64 0x74 0x84 0x94 0xa4 0xb4 0xc4 0xd4 0xe4 0xf4
0x0c 0x1c 0x2c 0x3c 0x4c 0x5c 0x6c 0x7c 0x8c 0x9c 0xac 0xbc 0xcc 0xdc 0xec 0xfc

For a total of 138 possibilities.

Reference: https://protobuf.dev/programming-guides/encoding/