maropu/Snappy+BitShuffle

Last active October 19, 2020 06:17

Star () You must be signed in to star a gist
Fork () You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/maropu/01103215df34b317a7a7.js"></script>
Save maropu/01103215df34b317a7a7 to your computer and use it in GitHub Desktop.

Download ZIP

Raw

Snappy+BitShuffle

Author

maropu commented Mar 5, 2016

BitShuffle Kiyo Masui has proposed is a novel technique to improve compression of typed binary data. The technique itself is not a compression algorithm and it rearranges input typed and binary data into more compressible ones for LZ-variant algorithms such as LZ4 and Snappy. Apache Kudu, Hadoop storage for fast analysis, has already incorporated bit-shuffling in column encoding (See here for details). Typically, LZ-variants cannot compress a sequence of typed data (e.g., ints and floats) efficiently against various type-specific compression algorithms (e.g., delta coding and fpc) because of having different compression models. BitShuffle solves this issue and the rearranged data BitShuffle outputs become suitable for LZ-variants. To check this, I incorporated bit-shuffling in snappy-java and ran quick benchmarks by using a low-skewed 32-bit integer array;

Intel(R) Core(TM) i7-4578U CPU @ 3.00GHz
Compress(Lower Skew):               Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------
vanilla snappy(0.281)                     212 /  258         75.3          13.3       1.0X
snappy + bitshuffle(0.077)                 62 /   78        258.3           3.9       3.4X
parquet encoder(0.072)                    297 /  375         53.9          18.6       0.7X

Intel(R) Core(TM) i7-4578U CPU @ 3.00GHz
Uncompress(Lower Skew):             Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------
vanilla snappy                             93 /  105        171.8           5.8       1.0X
snappy + bitshuffle                        55 /   59        290.7           3.4       1.7X
parquet encoder                            77 /   78        207.2           4.8       1.2X

The exciting results I got!; in terms of compression ratios, snappy + bitshuffle overcomes vanilla snappy by 4 times and it is almost comparable with parquet encoder having an integer-specific encoder. Moreover, the compression/decompression speeds of snappy + bitshuffle are the fastest among them. Detailed benchmark codes and other experimental results can be found here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment