First let's consider the naive approach to var-int (var7) encoding.
In this scenario we just take each 7-bits of an encoded value and put them in chunks:
while value >= 0x80 {
let chunk = ((value as u8) & 0x7F) | 0x80;
buffer.push(chunk);
value >>= 7;
}
buffer.push(value as u8);
E.g. 0x3FFF
- 0b0011_1111_1111_1111
which is encoded as:
1_1111111 0_1111111
Now let's consider: 0x4000
- 0b0100_0000_0000_0000
encoded as:
1_0000000 1_0000000 0_0000001
If we look at the difference between the consecutive encoded values it's hard not to notice that the information about the additional (14) bit is encoded redundantly (in 2 places):
0x3FFF: 1_1111111 0_1111111
0x4000: 1_0000000 1_0000000 0_0000001 <- this bit here is really redundant!
^
this bit informs us that "there is more data"
What we could do instead is to subtract the redundant bit from the remaining bits of data:
Let's see the improved formula:
while value >= 0x80 {
let chunk = ((value as u8) & 0x7F) | 0x80;
buffer.push(chunk);
value = (value - 0x80) >> 7;
}
buffer.push(value as u8);
For example:
original | encoded | |
---|---|---|
0x007F |
0x7F |
|
0x0080 |
0x80 0x00 |
- now the redundant bit is gone! |
0x008F |
0x8F 0x00 |
|
0x00FF |
0xFF 0x00 |
|
0x0100 |
0x80 0x01 |
- now the LSB bit provides the new information instead of a redundant one! |
0x3FFF |
0xFF 0x7E |
|
0x4000 |
0x80 0x7F |
- still 2 bytes! We have successfully compressed information a bit more! |
Here's a link to the playground: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=904130aad758e52a97abcd32699ab430
How much do we save?
Let's consider storing ordered numbers starting from 0.
As we can see from the playground examples 128 additional values between 0x4000 and 0x407f can still be encoded in 2 bytes. 128 bytes for the first encoded 32'768 bytes.
Next we save 16'511 bytes for values encoded between 0x200000 and 0x20407f. That is 16'511 bytes saved per 6'274'944 encoded bytes.
While this is only less than 0.4% - 0.2% of saved space another reason to use this approach is that decoding algorithm is just simpler.
We only need to shift and add chunks to the result, no need to clip the highest bit with (& 0x7F) because we actually use the MSB bit of each chunk in our calculation. We have subtracted the highest chunk bit while the value was encoded so now adding the highest bit just makes the result even.