royaltm/better-var-7.md

## better-var-7.md

      
    Raw
  

              better-var-7.md
            
          
    First let's consider the naive approach to var-int (var7) encoding.
In this scenario we just take each 7-bits of an encoded value and put them in chunks:
    while value >= 0x80 {
        let chunk = ((value as u8) & 0x7F) | 0x80;
        buffer.push(chunk);
        value >>= 7;
    }
    buffer.push(value as u8);
E.g. 0x3FFF - 0b0011_1111_1111_1111 which is encoded as:
1_1111111 0_1111111

Now let's consider: 0x4000 - 0b0100_0000_0000_0000 encoded as:
1_0000000 1_0000000 0_0000001

If we look at the difference between the consecutive encoded values it's hard not to notice that
the information about the additional (14) bit is encoded redundantly (in 2 places):
0x3FFF: 1_1111111 0_1111111
0x4000: 1_0000000 1_0000000 0_0000001 <- this bit here is really redundant!
                  ^
                  this bit informs us that "there is more data"

What we could do instead is to subtract the redundant bit from the remaining bits of data:
Let's see the improved formula:
    while value >= 0x80 {
        let chunk = ((value as u8) & 0x7F) | 0x80;
        buffer.push(chunk);
        value = (value - 0x80) >> 7;
    }
    buffer.push(value as u8);
For example:


original
encoded


0x007F
0x7F


0x0080
0x80 0x00
- now the redundant bit is gone!


0x008F
0x8F 0x00


0x00FF
0xFF 0x00


0x0100
0x80 0x01
- now the LSB bit provides the new information instead of a redundant one!


0x3FFF
0xFF 0x7E


0x4000
0x80 0x7F
- still 2 bytes! We have successfully compressed information a bit more!


Here's a link to the playground: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=904130aad758e52a97abcd32699ab430
How much do we save?
Let's consider storing ordered numbers starting from 0.
As we can see from the playground examples 128 additional values between 0x4000 and 0x407f can still be encoded in 2 bytes.
128 bytes for the first encoded 32'768 bytes.
Next we save 16'511 bytes for values encoded between 0x200000 and 0x20407f.
That is 16'511 bytes saved per 6'274'944 encoded bytes.
While this is only less than 0.4% - 0.2% of saved space another reason to use this approach is that
decoding algorithm is just simpler.
We only need to shift and add chunks to the result, no need to clip the highest bit with (& 0x7F) because we
actually use the MSB bit of each chunk in our calculation. We have subtracted the highest chunk bit while
the value was encoded so now adding the highest bit just makes the result even.
original	encoded
`0x007F`	`0x7F`
`0x0080`	`0x80 0x00`	- now the redundant bit is gone!
`0x008F`	`0x8F 0x00`
`0x00FF`	`0xFF 0x00`
`0x0100`	`0x80 0x01`	- now the LSB bit provides the new information instead of a redundant one!
`0x3FFF`	`0xFF 0x7E`
`0x4000`	`0x80 0x7F`	- still 2 bytes! We have successfully compressed information a bit more!