Skip to content

Instantly share code, notes, and snippets.

@maropu
Last active October 19, 2020 06:17
Show Gist options
  • Save maropu/01103215df34b317a7a7 to your computer and use it in GitHub Desktop.
Save maropu/01103215df34b317a7a7 to your computer and use it in GitHub Desktop.
@maropu
Copy link
Author

maropu commented Mar 5, 2016

BitShuffle Kiyo Masui has proposed is a novel technique to improve compression of typed binary data. The technique itself is not a compression algorithm and it rearranges input typed and binary data into more compressible ones for LZ-variant algorithms such as LZ4 and Snappy. Apache Kudu, Hadoop storage for fast analysis, has already incorporated bit-shuffling in column encoding (See here for details). Typically, LZ-variants cannot compress a sequence of typed data (e.g., ints and floats) efficiently against various type-specific compression algorithms (e.g., delta coding and fpc) because of having different compression models. BitShuffle solves this issue and the rearranged data BitShuffle outputs become suitable for LZ-variants. To check this, I incorporated bit-shuffling in snappy-java and ran quick benchmarks by using a low-skewed 32-bit integer array;

Intel(R) Core(TM) i7-4578U CPU @ 3.00GHz
Compress(Lower Skew):               Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------
vanilla snappy(0.281)                     212 /  258         75.3          13.3       1.0X
snappy + bitshuffle(0.077)                 62 /   78        258.3           3.9       3.4X
parquet encoder(0.072)                    297 /  375         53.9          18.6       0.7X
Intel(R) Core(TM) i7-4578U CPU @ 3.00GHz
Uncompress(Lower Skew):             Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------
vanilla snappy                             93 /  105        171.8           5.8       1.0X
snappy + bitshuffle                        55 /   59        290.7           3.4       1.7X
parquet encoder                            77 /   78        207.2           4.8       1.2X

The exciting results I got!; in terms of compression ratios, snappy + bitshuffle overcomes vanilla snappy by 4 times and it is almost comparable with parquet encoder having an integer-specific encoder. Moreover, the compression/decompression speeds of snappy + bitshuffle are the fastest among them. Detailed benchmark codes and other experimental results can be found here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment