Skip to content

Instantly share code, notes, and snippets.

@keith-turner
Last active May 14, 2020 03:50
Show Gist options
  • Save keith-turner/f6f7ca661d88b935b74f57dbdeeb07d3 to your computer and use it in GitHub Desktop.
Save keith-turner/f6f7ca661d88b935b74f57dbdeeb07d3 to your computer and use it in GitHub Desktop.
Accumulo Compaction Use Cases

This document is a work in progress and goes with #1605

Different compression algorithms

Users can get better throughput without sacrificing storage space by using snappy for small compactions and gzip for large compactions. This can be achieved by configuring the CompactionConfigurer implementation CompressionConfigurer for a table. After configured this would be used for all compactions, unless a user initiated compaction specified a CompactionConfigurer.

Selectively filtering data

For many reasons users may wish to filter data from an Accumulo table. One example use case would be that unwanted data was erroneously written to a table. This can be accomplished using compaction and/or scan time iterators. Compacting an entire Accumulo table is usually something someone would want to avoid. Users can initiate a one time compaction with iterators that is selective using table ranges to limit tablets and/or a CompactionSelector to limit files within a tablet. For example a compaction selector could use summary information or file creation times to only select files that meet a given criteria for compaction. If user had configured a compaction configurer for the table, it would still be used for a user initiated compaction that does selection (this was not possible with the old compaction strategies).

Controlling compaction execution resources.

Currently user iniated compactions can monopolize all compaction threads, leading to file per tablet growing unbounded as new data is imported which negatively impacts scans. With the new design users can configure multiple compaction services to execute compactions. A user can configure a table to use one compaction service for user initiated compactions and another compaction service for system compactions by configuring a CompactionDispatcher like SimpleCompactionDispatcher for a table. This allows the files per tablet to be kept low even when a user is doing something like filtering data by keeping user compactions from monopolizing all compaction threads. This is also aided by the fact that a tablet can now have concurrent compactions of disjoint files per tablet, so if a user compaction is compacting 5 big files and 30 new small files arrive then a system compaction process the new files.

Continuous selection of files for compaction

An earlier use case talked about selecting files for a one time filtering operation. Another use case is continuous monitoring of tablet to force a compaction based on data in the tablet. One use case for this is to monitor tablets for too many delete markers using summary data and force a compaction of all files when this happens. This use case can be satisfied by configuring a the TooManyDeletesSelector for a table.

Fluo also has a use case for contual selection to periodiocally clean up the trail of data left by compactions. This is partially done for Fluo, changes were made to generate summary data in #1071. However nothing has been done yet for the compaction part as nothing suitable existed in Accumulo.

TODO add shell commands showing how to create a compaction service and make a table use it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment