Skip to content

Instantly share code, notes, and snippets.

@royaltm
Last active June 21, 2024 14:13
Show Gist options
  • Save royaltm/f31ce5b1c8878e7c10ff7d335f3826e2 to your computer and use it in GitHub Desktop.
Save royaltm/f31ce5b1c8878e7c10ff7d335f3826e2 to your computer and use it in GitHub Desktop.
how to encode var-int

First let's consider the naive approach to var-int (var7) encoding.

In this scenario we just take each 7-bits of an encoded value and put them in chunks:

    while value >= 0x80 {
        let chunk = ((value as u8) & 0x7F) | 0x80;
        buffer.push(chunk);
        value >>= 7;
    }
    buffer.push(value as u8);

E.g. 0x3FFF - 0b0011_1111_1111_1111 which is encoded as:

1_1111111 0_1111111

Now let's consider: 0x4000 - 0b0100_0000_0000_0000 encoded as:

1_0000000 1_0000000 0_0000001

If we look at the difference between the consecutive encoded values it's hard not to notice that the information about the additional (14) bit is encoded redundantly (in 2 places):

0x3FFF: 1_1111111 0_1111111
0x4000: 1_0000000 1_0000000 0_0000001 <- this bit here is really redundant!
                  ^
                  this bit informs us that "there is more data"

What we could do instead is to subtract the redundant bit from the remaining bits of data:

Let's see the improved formula:

    while value >= 0x80 {
        let chunk = ((value as u8) & 0x7F) | 0x80;
        buffer.push(chunk);
        value = (value - 0x80) >> 7;
    }
    buffer.push(value as u8);

For example:

original encoded
0x007F 0x7F
0x0080 0x80 0x00 - now the redundant bit is gone!
0x008F 0x8F 0x00
0x00FF 0xFF 0x00
0x0100 0x80 0x01 - now the LSB bit provides the new information instead of a redundant one!
0x3FFF 0xFF 0x7E
0x4000 0x80 0x7F - still 2 bytes! We have successfully compressed information a bit more!

Here's a link to the playground: https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=904130aad758e52a97abcd32699ab430

How much do we save?

Let's consider storing ordered numbers starting from 0.

As we can see from the playground examples 128 additional values between 0x4000 and 0x407f can still be encoded in 2 bytes. 128 bytes for the first encoded 32'768 bytes.

Next we save 16'511 bytes for values encoded between 0x200000 and 0x20407f. That is 16'511 bytes saved per 6'274'944 encoded bytes.

While this is only less than 0.4% - 0.2% of saved space another reason to use this approach is that decoding algorithm is just simpler.

We only need to shift and add chunks to the result, no need to clip the highest bit with (& 0x7F) because we actually use the MSB bit of each chunk in our calculation. We have subtracted the highest chunk bit while the value was encoded so now adding the highest bit just makes the result even.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment