Skip to content

Instantly share code, notes, and snippets.

@klauspost
Last active June 1, 2023 08:24
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save klauspost/4be0a5e7653a0ee52530052e8c0f3e55 to your computer and use it in GitHub Desktop.
Save klauspost/4be0a5e7653a0ee52530052e8c0f3e55 to your computer and use it in GitHub Desktop.

NG byte aligned compression

Intro

This is trying to combine the lessons learned from LZ4, Snappy, S2 and friends.

  • LZ4: Allow matches > 65536. More efficient long matches. More efficient short offsets.
  • Snappy: Improve the max match length. More efficient longer match offsets.
  • S2: More efficient repeat storage, relative offsets. Add 24 bits copy lengths.

Additional interesting:

  • Minumim match length 8
  • Minimum offset 8
  • No match start less than 8 bytes from end.

We stay away from anything that would make decompression significantly slower/more intensive. That means no entropy compression and no ROLZ, no transforms or similar.

Encoding is to remain byte-aligned, with each operation starting at a full byte.

Format

A compressed file is a sequence of operations.

Each operation starts with a tag.

Tags are stored as [ID bits 6...7][VALUE 0...5]

The ID represents the operation to perform.

ID Command
00 Literals
01 Lits + Copy Near
10 Copy
11 Repeat

The VALUE is the lower 6 bits and represents an unsigned integer. The value has different meaning for each operation type.

The maximum length of an operations, excluding any data to copy for literals is 1+3+3 = 7 bytes. (offset > 132096 and length > 8235)

Value Represents Range
0-60 Value = value 0-60
61 Read 1 byte. Value is 61+uint8(read). 61-316
62 Read 2 bytes. Value is 317 + uint16(read) 317-65852
63 Read 3 Bytes. Value is 65853 + uint24(read) 65853 - 16843068

Extra bytes to be read are immediately after the tag.

Values longer than 16843068 cannot be represented. Compressors can split these into multiple statements.

Different operations have different use of the value.

It is not possible to encode invalid values. While the operation may get values that are impossible to satisfy due to the amount currently decoded, invalid values cannot be represented.

Commands

Copy Literals.

out_length = value + 1

Literals are uncompressed data stored directly in the byte stream.

Copy out_length bytes from the stream to output.

Lits + Near Copy

This command will output a number of literals followed by a match copy with an offset <= 65536 and a max copy length of 11 bytes.

copy_len = (value&7) + 4 // 4 -> 11
lit_length = (value >> 3) + 1 // 1 -> 1052691
offset = read_2_bytes() + 1 // 1 -> 65536
copy literals
apply copy
(next operation)

Length depends on literals. The following table shows the total number of bytes (excluding literals) for a given number of literals:

Literals → 1 -> 8 9 -> 40 40 -> 8232 8233-> 2105384
Output size 3 4 5 6

If more than 11 copy bytes should be emitted, use a repeat command to emit more.

Copy

Copy with long offset, either reading a 1, 2 or 3 byte offset.

Two lowest bits of value determines copy type:

Bits Operation Read bytes Bits from value Offset Bits Offset base
00, 01 Copy 1 1 1 9 1
10 Copy 2 2 1 17 513
11 Copy 3 3 -2 22 131585

Copy 1

Offsets 1->512. Reads 1 byte.

  • Offset = uint8 | value[1] << 8 + 1
  • Length = value >> 2 + 4
Length → 4 -> 19 20 -> 83 84 -> 16467 16468 -> 4210771
Output size 2 3 4 5

Copy 2

Offsets 513 -> 131584. 17 bits offset.

Read 2 bytes (little endian) as x.

  • Offset = x | value[2] << 16 + 513
  • Length = value[3:] + 4

Output sizes:

Length → 4 -> 11 12 -> 43 44 -> 8235 8236 -> 2105388
Output size 3 4 5 6

Copy 3

Offsets 131585 -> 4325888. 22 bits offset. Cheap length on matches.

Length 4 matches are allowed, but not recommended.

Read 3 byte as x (little endian order)

  • Offset = x[0:21] + 131585
  • Length = x[22:24] + (value &~3) + 4

Output sizes:

Length → 4 -> 64 65 -> 320 321 -> 65856 65856 -> 16843072
Output size 4 5 6 7

It is not possible to represent a copy longer than 4325888 bytes, ~4 MiB back. This seems like a reasonable limit.

Repeat (with offset)

Repeat last copy offset.

out_length = value[2:] + 1 || 4

Offset can be modified. value[0:2] contains the modifier:

Modifier Represents Read Bytes Offset Range Length
00 Use offset as-is 0 0 1-4210768
01 offset = offset +- 2 0 -2 -> 2 (excl 0) 4-...
10 offset = offset + read_signed_int() 1 -128 -> +127 4-4210771
11 offset = offset + read_signed_int16() 2 -32768 -> +32767 4-4210771

Note how the modifier is added to the length.

This also ensures that a copy can always be followed by a repeat if the length cannot be represented.

Length has pretty cheap encoding:

Offset ↓ / Length → 1 -> 16 17 -> 80 80 -> 16464 16464-> 4210768
0 1 2 3 4
-2 -> 2 1 2 3 4
-128 -> +127 2 3 4 5
-32768 -> +32767 3 4 5 6

Offsets have are bigger when going down the table. Ie -32768 -> +32767 has range 4 -> 19, 20 -> 83, etc.

	delta := int(offset) - int(lastoffset)
	switch {
	case delta == 0:
		// emit modifier '00'
	case delta <= 2 && delta >= -2:
		// emit modifier '01' - can be omitted for speed.
	case delta >= math.MinInt8 && delta <= math.MaxInt8:
		// emit modifier '10'
	case delta >= math.MinInt16 && delta <= math.MaxInt16:
		// emit modifier '11'
	}

Encoding decision tree

Note that encoding can be much simpler for faster compression modes or rely on internal checks to avoid the full decision tree.

	if e.MatchLen == 0 {
		s.emitLits(e.LitLen)
		return
	}

	// Always at least just as good as the alternatives.
	delta := int(e.Offset) - int(s.lastoffset)
	if delta >= math.MinInt8 && delta <= math.MaxInt8 {
		s.emitLits(e.LitLen)
		s.emitRepeat(e.Offset, e.MatchLen)
		return
	}
	bigOffset := e.Offset > shortOffset
	canRepeat := delta >= math.MinInt16 && delta <= math.MaxInt16 && bigOffset

	// If no literals, we don't have to consider the combination
	if e.LitLen == 0 {
		if canRepeat {
			s.emitRepeat(e.Offset, e.MatchLen)
			return
		}
		s.emitCopy(e)
		return
	}

	// Add combined if possible and we have either very small match or lit length.
	if bigOffset && e.Offset <= 65536 && (e.MatchLen <= 11 || e.LitLen < 7) && e.LitLen < 2<<20 {
		s.emitLitCopy(e)
		return
	}

	// Emit lits separately
	s.emitLits(e.LitLen)

	// Repeat if it makes sense.
	if canRepeat {
		s.emitRepeat(e.Offset, e.MatchLen)
		return
	}
	s.emitCopy(e)

IDEAS AND STUFF

Endgame

It would be really nice if a tag cannot start with less than 8 bytes of the encoded block ending.

That way we always knows there will be enough data to read the command and we don't have to validate that.

It will however more or less force the encoder to output the last 6 bytes as literals.

Repeats

We could keep 2 offsets and maybe add 1 bit to repeats.

Tried it. Worse in all cases.

Literals

We often have extra bits for literal copies in the value. Can't see anything reasonable to use.

@klauspost
Copy link
Author

klauspost commented Mar 8, 2023

NYC: (nyc-taxi-data-10M.csv   s2  3   3325605752  773678211)
    * S2 Size: 765447024
    * NG Size: 686325672 - Brute Force: 686303567
    * Gain: 79121352 bytes (10.34%; brute: 10.34%; 11.26% ex lits)

ENWIK9:(enwik9  s2  3   1000000000  370860824)
    * S2 Size: 364268935
    * NG Size: 336917924 - Brute Force: 336896076
    * Gain: 27351011 bytes (7.51%; brute: 7.51%; 8.97% ex lits)

RANKS: (github-ranks-backup.bin s2  3   1862623243  553965705)
    * S2 Size: 536123900
    * NG Size: 496903025 - Brute Force: 496774758
    * Gain: 39220875 bytes (7.32%; brute: 7.34%; 10.24% ex lits)

GITHUB: (github-june-2days-2019.json s2  3   6273951764  826384742)
    * S2 Size: 809517862
    * NG Size: 788315878 - Brute Force: 788176052
    * Gain: 21201984 bytes (2.62%; brute: 2.64%; 4.03% ex lits)

CONSENSUS: (consensus.db.10gb   s2  3   10737418240 4210593068)
    * S2 Size: 4157536182
    * NG Size: 4103624175 - Brute Force: 4102970365
    * Gain: 53912007 bytes (1.30%; brute: 1.31%; 7.65% ex lits)

SOFIA:(sofia-air-quality-dataset.tar   s2      3       15464463872     4017422246)
    * S2 Size: 3980536668
    * NG Size: 3728298101 - Brute Force: 3728268060
    * Gain: 252238567 bytes (6.34%; brute: 6.34%; 7.15% ex lits)

MINT: (rawstudio-mint14.tar    s2  3   8558382592  3905189070)
    * S2 Size: 3705139121
    * NG Size: 3619864350 - Brute Force: 3618459691
    * Gain: 85274771 bytes (2.30%; brute: 2.34%; 7.29% ex lits)

@klauspost
Copy link
Author

klauspost commented Mar 10, 2023

Alternatively we could just change S2 to emit long matches as 3 bytes instead of 4 (S2 sizes adjusted)

    decode_test.go:218: ENWIK9:
    decode_test.go:219: S2 Size: 340160015
    decode_test.go:220: NG Size: 336917924 - Brute Force: 336896076
    decode_test.go:223: Gain: 3242091 bytes (0.95%; brute: 0.96%; 1.15% ex lits)
    decode_test.go:218: RANKS:
    decode_test.go:219: S2 Size: 510767105
    decode_test.go:220: NG Size: 496903025 - Brute Force: 496774758
    decode_test.go:223: Gain: 13864080 bytes (2.71%; brute: 2.74%; 3.88% ex lits)
    decode_test.go:218: GITHUB:
    decode_test.go:219: S2 Size: 781331477
    decode_test.go:220: NG Size: 788315878 - Brute Force: 788176052
    decode_test.go:223: Gain: -6984401 bytes (-0.89%; brute: -0.88%; -1.40% ex lits)
    decode_test.go:218: CONSENSUS:
    decode_test.go:219: S2 Size: 4121907724
    decode_test.go:220: NG Size: 4103624175 - Brute Force: 4102970365
    decode_test.go:223: Gain: 18283549 bytes (0.44%; brute: 0.46%; 2.73% ex lits)
    decode_test.go:218: NYC:
    decode_test.go:219: S2 Size: 688973678
    decode_test.go:220: NG Size: 686325672 - Brute Force: 686303567
    decode_test.go:223: Gain: 2648006 bytes (0.38%; brute: 0.39%; 0.42% ex lits)
    decode_test.go:218: MINT:
    decode_test.go:219: S2 Size: 3661007999
    decode_test.go:220: NG Size: 3619864350 - Brute Force: 3618459691
    decode_test.go:223: Gain: 41143649 bytes (1.12%; brute: 1.16%; 3.65% ex lits)
    decode_test.go:218: SOFIA:
    decode_test.go:219: S2 Size: 3668146768
    decode_test.go:220: NG Size: 3728298101 - Brute Force: 3728268060
    decode_test.go:223: Gain: -60151333 bytes (-1.64%; brute: -1.64%; -1.87% ex lits)

LOL

@klauspost
Copy link
Author

klauspost commented Mar 11, 2023

With scoring tweaked for this, it might end up with a 2-5% gain, traded for a lot of complexity compared to a one line change in S2.

NG tests: https://github.com/klauspost/compress/blob/ng-comp/s2/decode_test.go#L192

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment