Skip to content

Instantly share code, notes, and snippets.

@angch
Last active November 8, 2019 02:56
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save angch/1341e0c48a1caecf7aa0ecc8d27a83c5 to your computer and use it in GitHub Desktop.
Save angch/1341e0c48a1caecf7aa0ecc8d27a83c5 to your computer and use it in GitHub Desktop.
Sorry, folks, gonna brain dump on my thoughts on FB's zstd compression here. context: http://www.zdnet.com/article/facebook-open-sources-zstandard-data-compression-algorithm-aims-to-replace-technology-behind-zip/
The algo seems very similar to google's brotli, in that they have a decent dictionary or prev common terms even before they start looking at the input to be compressed.
In Mark Nelson's Data Compression Book's terminology, they have a generate and use a better Data Model of the data to be compressed, even before they started. In FB's zstd, they use "learning" to precompute this data model, where as in brotli, I think they hardcoded it.
See rfc17932 on brotli, where the bulk of it is a section on the
Appendix A. Static Dictionary Data they used.
https://www.ietf.org/rfc/rfc7932.txt
(ref: mark nelson: https://www.amazon.com/Data-Compression-Book-Mark-Nelson/dp/1558514341 )
As for the actual algo in both zstd and brotli, they used a variant of LZ77 + a entropy coder.
Not sure if zstd uses lz77 or lz78. It uses FSE (FiniteStateEntropy) entropy coding though, unlike brotli, which uses the venerable Huffman coding.
(for entropy coding).
Fascinating, because not *that* much have changed since the 80s when it comes to general lossless compression. Just slightly better algos for different stages, and much much better modelling.
"""By design, Huffman can't break the "1 bit per symbol" limit, hence loses efficiency on squeezed distributions, such as Proba80. FSE is free of such limit, and its compression efficiency remains close to Shannon limit in all circumstances. However, this accuracy is not always necessary, and less compressible distributions show little difference with Huffman. On its side, Huff0 delivers in the form of a massive speed advantage."""
https://github.com/Cyan4973/FiniteStateEntropy
This means FSE is almost always better than Huffman.
Before this, they were using Arithmetic coding (slow! patent encumbered) to get around that 1bit limitation.
Fun fact: bzip2 compresses not as well as bzip1. Because bzip1 uses arithmetic coding, and bzip2 reverted back to huffman coding.
bzip1 is *slow*. bzip2 is orders of magnitude faster than bzip1
Just had to get that out of my system, as a data compression geek.
Just FYI one of the usecases of all the compression stuff is to optimize processing and distribution of geodata... planet.osm (if you're macho enough) is *huge*.
Playing with squash benchmark, brotli level 10 (Google's) almost always comes out ahead for download+uncompress total time, whereas zstd (Facebook's) comes out for compress+upload.
Makes sense.
In other news, "B-Frame" for video compression is literally middle out. https://en.wikipedia.org/wiki/Video_compression_picture_types
Archving is like backups, it's useless if you can never easily restore them. Stick to lzma/xz
Brotli and zstd doesn't make sense for a file manager/archiver like 7zip.
For dynamically generated data, e.g IPC. Hmm, on par with zstd, then. Fast compress+decompress. brotli is fast decompress, slower (but on par with gzip) compress.
Related, for voice: https://auphonic.com/blog/2018/06/01/codec2-podcast-on-floppy-disk/ Blame the google deepmind wavenet speech synthesis/recognition work.
Compress once, decompress many, basically. Like web fonts. And map tiles.
opcode0x90: I was going to joke about somebody should accelerate zpaq with GPU, but apparently it is a thing (not zpaq specifically): https://github.com/IlyaGrebnov/libbsc
Blah. H.264 is still sucky with GPU. Too many weird stuff. But nvenc, when it works, is *fast*.
opcode0x90: Bookmark this: https://github.com/Bulat-Ziganshin/FA
---
angch@server:~$ time gunzip < imagenet_fall11_urls.tgz | gzip -9 > imagenet_fall11_urls.tar.gz
real 1m40.852s
user 1m50.660s
sys 0m1.420s
angch@server:~$ time gunzip < imagenet_fall11_urls.tgz | zstd -9 -T8 > imagenet_fall11_urls.tar.zst
real 0m15.139s
user 1m54.732s
sys 0m2.204s
angch@server:~$ ls -l imagenet_fall11_urls.t*
-rw-rw-r-- 1 angch angch 343455454 Jul 9 09:40 imagenet_fall11_urls.tar.gz
-rw-rw-r-- 1 angch angch 305509480 Jul 9 09:46 imagenet_fall11_urls.tar.zst
-rw-r--r-- 1 angch angch 350302759 Jul 4 10:14 imagenet_fall11_urls.tgz
---
generate, zip, and upload to s3: 3m24.464s
generate, tarzst, and upload to s3: 0m46.795s
download from s3, unzip: 0m41.264s
download and unzstdtar directly from s3: 0m14.592s
(uncompressed source is 2.1GB, zip is 401M, tar.zst is 384M) Yeah. zstd rules.
Cheated, I used zstd -9 -T12.
One thing about zstd: multithreaded/partitioned compression gives same output, unlike, say, pigz or pzip2. (edited)
I tried xz -9 -e. It took 20m29.597s Much smaller. 251MB
---
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment