Skip to content

Instantly share code, notes, and snippets.

@jolynch
Last active December 3, 2020 04:02
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jolynch/9118465f32ad5298b4e39d03ccd4370e to your computer and use it in GitHub Desktop.
Save jolynch/9118465f32ad5298b4e39d03ccd4370e to your computer and use it in GitHub Desktop.
BoundedReadCompactionStrategy

BoundedReadCompactionStrategy

This compaction strategy aims to take the place of all three existing Cassandra Compaction Strategies.

It works by first consolidating sparse data (sorted runs that span the entire token range but are small) until it reaches a configruable target amount of data (by default 16 * target size = 8 GiB). Then it promotes these sorted runs to be "dense" sorted runs.

Dense runse are considered for compaction via density tiering rather than size. The node's range is broken up into the target work size and then compactions are performed "across" the levels. This stage uses a different tiering factor because you often want control over how young versus old data compacts with itself.

Phase 1: Flushes

  • Flush products are very small and very sparse. They overlap with everything and are generally a bummer.
  • Output: Attempt to split the flush products into split_range size sorted run (will probably not succeed).

Phase 2: Sparse Run Consolidation

  • Compact across the entire token range.
  • More or less straight up size tiering, with some automatic min_threshold adjustment based on flush rate.
  • Output: Split the sparse products into split_range size sorted runs.

Phase 3: Dense Run De-overlapping

  • Compact across pieces of sorted runs split evenly into work_size splits of the node's data.
  • Sorted runs are tiered with a factor of min_threshold_levels (4), subject to a constraint that the overlaps remain under max_read_per_read (10). If we do exceed max_read_per_read the youngest min_threshold_levels levels compact.
  • Major compaction across dense and sorted runs occurs on two conditions and any full compactions are spread out over an target_rewrite_interval_in_seconds of time (4 hours) and involves a target_work_unit_in_mb amount of data (8 GiB). This allows us to make progress even with only a target_work_unit_in_mb amount of disk space.
  • Major Compaction Case #1: If old dense runs are found (by default 10 days old). This allows us to ensure that tombstones find their data within 10 days. Can be disabled by setting max_level_age_in_seconds to zero
  • Major Compaction Case #2: A user has triggered a full compaction. Sparse runs are compacted immediately and dense runs will proceed along the interval (target_rewrite_interval_in_seconds).

Note The defaults are fine in almost all cases. You only need to tune these things if you want to disable some behavior or really want to optimize. Changes in write throughput are gracefully handled e.g. for bulk loading (once we observe 4 rapid flushes we will adjust the min threshold automatically).

min_threshold (4) : Consolidate this number of sparse runs of similar size
max_threshold (32) : Do not involve more than this number of sorted runs in consolidation
Also is the max overlap in the sparse runs.
target_work_unit_in_mb (16x target) : How much data to consolidate before promoting to the "levels". Also the
unit of full compaction
min_sstable_size_in_mb (50 MiB) : The minimum size sstables can split into during consolidation
target_sstable_size_in_mb (512 MiB) : Target chunk size for sorted runs
target_consolidate_interval_in_seconds (1 hour) : How long before consolidating sparse runs regardless of overlap;
target_rewrite_interval_in_seconds (4 hours) : When doing full compactions, how long to spread them out
max_read_per_read (10) : How many dense runs to allow in a range slice before compacting
max_sstable_count (2000) : Max number of sstables to aim for on a node. TargetSize may be
increased automatically to meet this goal.
min_threshold_levels (4) : How many similarly dense sorted runs in a work unit that overlap before
compacting.
max_level_age_in_seconds (10 days) : If a sorted run has lived for longer than this duration, full compact
across the run (doing ~= target_work work);
level_bucket_low (0.75) : How similar in density sorted runs have to be to compact
level_bucket_high (1.25)
# We want to reduce the read amplification bound and more aggressively pay write amplification
# to reduce read amplification.
max_read_per_read = 6
min_threshold_levels = 2
# Disable density tiering in the levels. Rely on either max_read_per_read compactions or
# tombstone compactions to clean up data.
min_threshold_levels = 0
max_read_per_read = 64
# Note that if we hit the max_read_per_read (which will happen if we have 64 * 16GiB = 1TiB of data)
# we will consolidate min(4, min_threshold_levels) youngest runs. So in this case if we go above 1 TiB
# of data we'll collect 4 young dense runs together which moves to a 4 hour drop window from a 1 hour
# drop window. If you want to drop fully expired sstables, generally speaking max_read_per_read should be
# 2 * (node dataset size / target_work_unit_in_mb)
# Work towards 16 GiB "dense" runs, consolidating every hour
target_work_unit_in_mb = 16384
target_consolidate_interval_in_seconds = 1 hour
# Still do density tiering, but require a high tiering factor so that we do very few consolidations of
# the dense data, reducing our write amplification
min_threshold_levels = 32
max_read_per_read = 64
# Optional, if you want to turn off the default 10 day full compactions
# For most long TTL use cases you don't have to do this because 10 days is
# much lower than the TTL.
# max_level_age_in_seconds = 0
# Work towards 16 GiB "dense" runs, consolidating at least once every day
target_work_unit_in_mb = 16384
target_consolidate_interval_in_seconds = 86400 (1 day)
# Raise our read amplification bounds to reduce write amplification.
max_read_per_read = 16
min_threshold_levels = 8
# Work towards larger work units since we're writing heavily
target_work_unit_in_mb = 16384
target_consolidate_interval_in_seconds = 4 hour
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment