klauspost/byte-compression.md

## byte-compression.md

      
    Raw
  

              byte-compression.md
            
          
    NG byte aligned compression

Intro

This is trying to combine the lessons learned from LZ4, Snappy, S2 and friends.

LZ4: Allow matches > 65536. More efficient long matches. More efficient short offsets.
Snappy: Improve the max match length. More efficient longer match offsets.
S2: More efficient repeat storage, relative offsets. Add 24 bits copy lengths.

Additional interesting:

Minumim match length 8
Minimum offset 8
No match start less than 8 bytes from end.

We stay away from anything that would make decompression significantly slower/more intensive.
That means no entropy compression and no ROLZ, no transforms or similar.
Encoding is to remain byte-aligned, with each operation starting at a full byte.
Format

A compressed file is a sequence of operations.
Each operation starts with a tag.
Tags are stored as [ID bits 6...7][VALUE 0...5]
The ID represents the operation to perform.


ID
Command


00
Literals


01
Lits + Copy Near


10
Copy


11
Repeat


The VALUE is the lower 6 bits and represents an unsigned integer.
The value has different meaning for each operation type.
The maximum length of an operations, excluding any data to copy for literals is 1+3+3 = 7 bytes. (offset > 132096 and length > 8235)


Value
Represents
Range


0-60
Value = value
0-60


61
Read 1 byte. Value is 61+uint8(read).
61-316


62
Read 2 bytes. Value is 317 + uint16(read)
317-65852


63
Read 3 Bytes. Value is 65853 + uint24(read)
65853 - 16843068


Extra bytes to be read are immediately after the tag.
Values longer than 16843068 cannot be represented. Compressors can split these into multiple statements.
Different operations have different use of the value.
It is not possible to encode invalid values. While the operation may get values that are impossible to
satisfy due to the amount currently decoded, invalid values cannot be represented.
Commands

Copy Literals.

out_length = value + 1
Literals are uncompressed data stored directly in the byte stream.
Copy out_length bytes from the stream to output.
Lits + Near Copy

This command will output a number of literals followed by a match copy with an offset <= 65536
and a max copy length of 11 bytes.
copy_len = (value&7) + 4 // 4 -> 11
lit_length = (value >> 3) + 1 // 1 -> 1052691
offset = read_2_bytes() + 1 // 1 -> 65536
copy literals
apply copy
(next operation)

Length depends on literals. The following table shows the total number of bytes (excluding literals) for a given number of literals:


Literals →
1 -> 8
9 -> 40
40 -> 8232
8233-> 2105384


Output size
3
4
5
6


If more than 11 copy bytes should be emitted, use a repeat command to emit more.
Copy

Copy with long offset, either reading a 1, 2 or 3 byte offset.
Two lowest bits of value determines copy type:


Bits
Operation
Read bytes
Bits from value
Offset Bits
Offset base


00, 01
Copy 1
1
1
9
1


10
Copy 2
2
1
17
513


11
Copy 3
3
-2
22
131585


Copy 1

Offsets 1->512. Reads 1 byte.

Offset = uint8 | value[1] << 8 + 1
Length = value >> 2 + 4


Length →
4 -> 19
20 -> 83
84 -> 16467
16468 -> 4210771


Output size
2
3
4
5


Copy 2

Offsets 513 -> 131584. 17 bits offset.
Read 2 bytes (little endian) as x.

Offset = x | value[2] << 16 + 513
Length = value[3:] + 4

Output sizes:


Length →
4 -> 11
12 -> 43
44 -> 8235
8236 -> 2105388


Output size
3
4
5
6


Copy 3

Offsets 131585 -> 4325888. 22 bits offset. Cheap length on matches.
Length 4 matches are allowed, but not recommended.
Read 3 byte as x (little endian order)

Offset = x[0:21] + 131585
Length = x[22:24] + (value &~3) + 4

Output sizes:


Length →
4 -> 64
65 -> 320
321 -> 65856
65856 -> 16843072


Output size
4
5
6
7


It is not possible to represent a copy longer than 4325888 bytes, ~4 MiB back. This seems like a reasonable limit.
Repeat (with offset)

Repeat last copy offset.
out_length = value[2:] + 1 || 4
Offset can be modified. value[0:2] contains the modifier:


Modifier
Represents
Read Bytes
Offset Range
Length


00
Use offset as-is
0
0
1-4210768


01
offset = offset +- 2
0
-2 -> 2 (excl 0)
4-...


10
offset = offset + read_signed_int()
1
-128 -> +127
4-4210771


11
offset = offset + read_signed_int16()
2
-32768 -> +32767
4-4210771


Note how the modifier is added to the length.
This also ensures that a copy can always be followed by a repeat if the length cannot be represented.
Length has pretty cheap encoding:


Offset ↓ / Length →
1 -> 16
17 -> 80
80 -> 16464
16464-> 4210768


0
1
2
3
4


-2 -> 2
1
2
3
4


-128 -> +127
2
3
4
5


-32768 -> +32767
3
4
5
6


Offsets have are bigger when going down the table. Ie -32768 -> +32767 has range 4 -> 19, 20 -> 83, etc.
	delta := int(offset) - int(lastoffset)
	switch {
	case delta == 0:
		// emit modifier '00'
	case delta <= 2 && delta >= -2:
		// emit modifier '01' - can be omitted for speed.
	case delta >= math.MinInt8 && delta <= math.MaxInt8:
		// emit modifier '10'
	case delta >= math.MinInt16 && delta <= math.MaxInt16:
		// emit modifier '11'
	}

Encoding decision tree

Note that encoding can be much simpler for faster compression modes or rely on internal checks to avoid the full decision tree.
	if e.MatchLen == 0 {
		s.emitLits(e.LitLen)
		return
	}

	// Always at least just as good as the alternatives.
	delta := int(e.Offset) - int(s.lastoffset)
	if delta >= math.MinInt8 && delta <= math.MaxInt8 {
		s.emitLits(e.LitLen)
		s.emitRepeat(e.Offset, e.MatchLen)
		return
	}
	bigOffset := e.Offset > shortOffset
	canRepeat := delta >= math.MinInt16 && delta <= math.MaxInt16 && bigOffset

	// If no literals, we don't have to consider the combination
	if e.LitLen == 0 {
		if canRepeat {
			s.emitRepeat(e.Offset, e.MatchLen)
			return
		}
		s.emitCopy(e)
		return
	}

	// Add combined if possible and we have either very small match or lit length.
	if bigOffset && e.Offset <= 65536 && (e.MatchLen <= 11 || e.LitLen < 7) && e.LitLen < 2<<20 {
		s.emitLitCopy(e)
		return
	}

	// Emit lits separately
	s.emitLits(e.LitLen)

	// Repeat if it makes sense.
	if canRepeat {
		s.emitRepeat(e.Offset, e.MatchLen)
		return
	}
	s.emitCopy(e)
IDEAS AND STUFF

Endgame

It would be really nice if a tag cannot start with less than 8 bytes of the encoded block ending.
That way we always knows there will be enough data to read the command and we don't have to validate that.
It will however more or less force the encoder to output the last 6 bytes as literals.
Repeats


We could keep 2 offsets and maybe add 1 bit to repeats.

Tried it. Worse in all cases.
Literals

We often have extra bits for literal copies in the value. Can't see anything reasonable to use.
Value	Represents	Range
0-60	Value = value	0-60
61	Read 1 byte. Value is 61+uint8(read).	61-316
62	Read 2 bytes. Value is 317 + uint16(read)	317-65852
63	Read 3 Bytes. Value is 65853 + uint24(read)	65853 - 16843068
Modifier	Represents	Read Bytes	Offset Range	Length
`00`	Use offset as-is	0	0	1-4210768
`01`	offset = offset +- 2	0	-2 -> 2 (excl 0)	4-...
`10`	offset = offset + read_signed_int()	1	-128 -> +127	4-4210771
`11`	offset = offset + read_signed_int16()	2	-32768 -> +32767	4-4210771