angch/angch-datacompression-braindump.txt

## angch-datacompression-braindump.txt
Sorry, folks, gonna brain dump on my thoughts on FB's zstd compression here. context: http://www.zdnet.com/article/facebook-open-sources-zstandard-data-compression-algorithm-aims-to-replace-technology-behind-zip/

The algo seems very similar to google's brotli, in that they have a decent dictionary or prev common terms even before they start looking at the input to be compressed.

In Mark Nelson's Data Compression Book's terminology, they have a generate and use a better Data Model of the data to be compressed, even before they started. In FB's zstd, they use "learning" to precompute this data model, where as in brotli, I think they hardcoded it.

See rfc17932 on brotli, where the bulk of it is a section on the
Appendix A.  Static Dictionary Data they used.

https://www.ietf.org/rfc/rfc7932.txt

(ref: mark nelson: https://www.amazon.com/Data-Compression-Book-Mark-Nelson/dp/1558514341 )

As for the actual algo in both zstd and brotli, they used a variant of LZ77 + a entropy coder.

Not sure if zstd uses lz77 or lz78. It uses FSE (FiniteStateEntropy) entropy coding though, unlike brotli, which uses the venerable Huffman coding.

(for entropy coding).

Fascinating, because not *that* much have changed since the 80s when it comes to general lossless compression. Just slightly better algos for different stages, and much much better modelling.

"""By design, Huffman can't break the "1 bit per symbol" limit, hence loses efficiency on squeezed distributions, such as Proba80. FSE is free of such limit, and its compression efficiency remains close to Shannon limit in all circumstances. However, this accuracy is not always necessary, and less compressible distributions show little difference with Huffman. On its side, Huff0 delivers in the form of a massive speed advantage."""

https://github.com/Cyan4973/FiniteStateEntropy

This means FSE is almost always better than Huffman.

Before this, they were using Arithmetic coding (slow! patent encumbered) to get around that 1bit limitation.

Fun fact: bzip2 compresses not as well as bzip1. Because bzip1 uses arithmetic coding, and bzip2 reverted back to huffman coding.

bzip1 is *slow*. bzip2 is orders of magnitude faster than bzip1

Just had to get that out of my system, as a data compression geek.

Just FYI one of the usecases of all the compression stuff is to optimize processing and distribution of geodata... planet.osm (if you're macho enough) is *huge*.

Playing with squash benchmark, brotli level 10 (Google's) almost always comes out ahead for download+uncompress total time, whereas zstd (Facebook's) comes out for compress+upload.

Makes sense.

In other news, "B-Frame" for video compression is literally middle out. https://en.wikipedia.org/wiki/Video_compression_picture_types

Archving is like backups, it's useless if you can never easily restore them. Stick to lzma/xz

Brotli and zstd doesn't make sense for a file manager/archiver like 7zip.

For dynamically generated data, e.g IPC. Hmm, on par with zstd, then. Fast compress+decompress. brotli is fast decompress, slower (but on par with gzip) compress.

Related, for voice: https://auphonic.com/blog/2018/06/01/codec2-podcast-on-floppy-disk/ Blame the google deepmind wavenet speech synthesis/recognition work.

Compress once, decompress many, basically. Like web fonts. And map tiles.

opcode0x90: I was going to joke about somebody should accelerate zpaq with GPU, but apparently it is a thing (not zpaq specifically): https://github.com/IlyaGrebnov/libbsc

Blah. H.264 is still sucky with GPU. Too many weird stuff. But nvenc, when it works, is *fast*.

opcode0x90: Bookmark this: https://github.com/Bulat-Ziganshin/FA

---

angch@server:~$ time gunzip < imagenet_fall11_urls.tgz | gzip -9 > imagenet_fall11_urls.tar.gz

real    1m40.852s
user    1m50.660s
sys    0m1.420s
angch@server:~$ time gunzip < imagenet_fall11_urls.tgz | zstd -9 -T8 > imagenet_fall11_urls.tar.zst

real    0m15.139s
user    1m54.732s
sys    0m2.204s
angch@server:~$ ls -l imagenet_fall11_urls.t*
-rw-rw-r-- 1 angch angch 343455454 Jul  9 09:40 imagenet_fall11_urls.tar.gz
-rw-rw-r-- 1 angch angch 305509480 Jul  9 09:46 imagenet_fall11_urls.tar.zst
-rw-r--r-- 1 angch angch 350302759 Jul  4 10:14 imagenet_fall11_urls.tgz

---

generate, zip, and upload to s3:  3m24.464s
generate, tarzst, and upload to s3: 0m46.795s

download from s3, unzip: 0m41.264s
download and unzstdtar directly from s3: 0m14.592s

(uncompressed source is 2.1GB, zip is 401M, tar.zst is 384M) Yeah. zstd rules.

Cheated, I used zstd -9 -T12.

One thing about zstd: multithreaded/partitioned compression gives same output, unlike, say, pigz or pzip2. (edited)

I tried xz -9 -e. It took 20m29.597s Much smaller. 251MB

---
	Sorry, folks, gonna brain dump on my thoughts on FB's zstd compression here. context: http://www.zdnet.com/article/facebook-open-sources-zstandard-data-compression-algorithm-aims-to-replace-technology-behind-zip/

	The algo seems very similar to google's brotli, in that they have a decent dictionary or prev common terms even before they start looking at the input to be compressed.

	In Mark Nelson's Data Compression Book's terminology, they have a generate and use a better Data Model of the data to be compressed, even before they started. In FB's zstd, they use "learning" to precompute this data model, where as in brotli, I think they hardcoded it.

	See rfc17932 on brotli, where the bulk of it is a section on the
	Appendix A. Static Dictionary Data they used.

	https://www.ietf.org/rfc/rfc7932.txt

	(ref: mark nelson: https://www.amazon.com/Data-Compression-Book-Mark-Nelson/dp/1558514341 )

	As for the actual algo in both zstd and brotli, they used a variant of LZ77 + a entropy coder.

	Not sure if zstd uses lz77 or lz78. It uses FSE (FiniteStateEntropy) entropy coding though, unlike brotli, which uses the venerable Huffman coding.

	(for entropy coding).

	Fascinating, because not that much have changed since the 80s when it comes to general lossless compression. Just slightly better algos for different stages, and much much better modelling.

	"""By design, Huffman can't break the "1 bit per symbol" limit, hence loses efficiency on squeezed distributions, such as Proba80. FSE is free of such limit, and its compression efficiency remains close to Shannon limit in all circumstances. However, this accuracy is not always necessary, and less compressible distributions show little difference with Huffman. On its side, Huff0 delivers in the form of a massive speed advantage."""

	https://github.com/Cyan4973/FiniteStateEntropy

	This means FSE is almost always better than Huffman.

	Before this, they were using Arithmetic coding (slow! patent encumbered) to get around that 1bit limitation.

	Fun fact: bzip2 compresses not as well as bzip1. Because bzip1 uses arithmetic coding, and bzip2 reverted back to huffman coding.

	bzip1 is slow. bzip2 is orders of magnitude faster than bzip1

	Just had to get that out of my system, as a data compression geek.

	Just FYI one of the usecases of all the compression stuff is to optimize processing and distribution of geodata... planet.osm (if you're macho enough) is huge.

	Playing with squash benchmark, brotli level 10 (Google's) almost always comes out ahead for download+uncompress total time, whereas zstd (Facebook's) comes out for compress+upload.

	Makes sense.

	In other news, "B-Frame" for video compression is literally middle out. https://en.wikipedia.org/wiki/Video_compression_picture_types

	Archving is like backups, it's useless if you can never easily restore them. Stick to lzma/xz

	Brotli and zstd doesn't make sense for a file manager/archiver like 7zip.

	For dynamically generated data, e.g IPC. Hmm, on par with zstd, then. Fast compress+decompress. brotli is fast decompress, slower (but on par with gzip) compress.

	Related, for voice: https://auphonic.com/blog/2018/06/01/codec2-podcast-on-floppy-disk/ Blame the google deepmind wavenet speech synthesis/recognition work.

	Compress once, decompress many, basically. Like web fonts. And map tiles.

	opcode0x90: I was going to joke about somebody should accelerate zpaq with GPU, but apparently it is a thing (not zpaq specifically): https://github.com/IlyaGrebnov/libbsc

	Blah. H.264 is still sucky with GPU. Too many weird stuff. But nvenc, when it works, is fast.

	opcode0x90: Bookmark this: https://github.com/Bulat-Ziganshin/FA

	---

	angch@server:~$ time gunzip < imagenet_fall11_urls.tgz \| gzip -9 > imagenet_fall11_urls.tar.gz

	real 1m40.852s
	user 1m50.660s
	sys 0m1.420s
	angch@server:~$ time gunzip < imagenet_fall11_urls.tgz \| zstd -9 -T8 > imagenet_fall11_urls.tar.zst

	real 0m15.139s
	user 1m54.732s
	sys 0m2.204s
	angch@server:~$ ls -l imagenet_fall11_urls.t*
	-rw-rw-r-- 1 angch angch 343455454 Jul 9 09:40 imagenet_fall11_urls.tar.gz
	-rw-rw-r-- 1 angch angch 305509480 Jul 9 09:46 imagenet_fall11_urls.tar.zst
	-rw-r--r-- 1 angch angch 350302759 Jul 4 10:14 imagenet_fall11_urls.tgz

	---

	generate, zip, and upload to s3: 3m24.464s
	generate, tarzst, and upload to s3: 0m46.795s

	download from s3, unzip: 0m41.264s
	download and unzstdtar directly from s3: 0m14.592s

	(uncompressed source is 2.1GB, zip is 401M, tar.zst is 384M) Yeah. zstd rules.

	Cheated, I used zstd -9 -T12.

	One thing about zstd: multithreaded/partitioned compression gives same output, unlike, say, pigz or pzip2. (edited)

	I tried xz -9 -e. It took 20m29.597s Much smaller. 251MB

	---