keith-turner/compaction_comp.md

## compaction_comp.md

      
    Raw
  

              compaction_comp.md
            
          
    Introduction

Accumulo users sometimes filter or transform data via compactions.  In current releases of Accumulo, these user initiated compactions can be disruptive to data currently being written.  To improve this situation, PR #1605 was created for the next release of Accumulo.  This PR enables dedicating resources to user initiated compactions. To verify if the PR is effective test with heavy ingest and concurrent user compactions were run. These test were run on two Azure clusters.  One cluster had a version of Accumulo containing the changes in #1605.  The other cluster had Accumulo 2.0.0.  This document describes the test and the outcomes and show that the changes in #1605 were beneficial in this scenario.
Terminology


Tablet : Each Accumulo table is divided into tablets.  Each tablet has a list of files in DFS where it stores the data in its range.
Minor compaction : When data is written to an Accumulo tablet its buffered into memory.  Periodically this data in memory is dumped to a file and added to the tablets list of files.  This is called a minor compaction.
System compaction : When Accumulo is being written to, minor compactions are always adding new files to tablets. When a tablet is queried, all of its files must be read. To make queries go faster, a tablets files are periodically combined into less files by system compactions.
User compaction : User may need to periodically filter or transform data in an Accumulo table.  One way to do this is to launch a user compaction with custom user code that filters or transforms data.  These user compactions read all of each tablets files, apply the users custom code, and write out a single file.
Compaction ratio : System compactions have to decide what subset of tablets files to compact.  This selection is done using the compaction ratio, a per table configuration.  Increasing the compaction ratio results in less compaction work and more files per tablet.  The amount of system compaction work done is logarithmic w.r.t. the compaction ratio.

Test Description

The test were carried out on two 11 node Azure clusters setup using Muchos. One cluster was setup using a version of Accumulo built from #1605.  The other cluster was setup using Accumulo 2.0.0.  The following was done on both clusters.

Configured use of snappy for system compactions less than 100M and gz otherwise.
Created 160 initial splits for continuous ingest table, resulting in 16 tablets per tablet server initially.
Ran 10 continuous ingest clients, each writing 6 billion key values.
Continually forced a full table user compactions of the continuous ingest table, sleeping 3 hours between each compaction.
Configured the compaction ratio to 2.

The following was only done on the cluster running the #1605 code.

For user compactions, configured a compaction service with 2 threads.
For system compactions, configured a compaction service with 2 threads for compactions <32MB and 2 threads for everything else.

The following was only done on the cluster running Accumulo 2.0.0

Configured the number of compactions threads to 6.  This was done because the #1605 cluster had a total of 6 threads.
Configured to not include files over 250M in system compactions.  Because the #1605 cluster did not do this, it was not desired, but it was not realized this was configured until after the test was well on its way.

Configuration for #1605 cluster

The following configuration options were set for the #1605 cluster at the system level to configure the described compactions services.  This created a compaction service named custom with two threads that could compact up to 30 files in a single compaction.  It also reconfigured the default compaction service to have 2 threads for compactions less than 32M and 2 threads for all other compactions.  The default compaction service was reconfigured to only compact up to 10 files in a single compaction.  The reason it was set to 10 is that this helps in the case when system compactions are falling behind, it forces the service to focus on the 10 smallest files of each tablet in a breadth first manner. Tablet server were restarted after these properties were set.
tserver.compaction.service.custom.planner=org.apache.accumulo.core.spi.compaction.DefaultCompactionPlanner
tserver.compaction.service.custom.planner.opts.executors=[{"name":"all","numThreads":2}]
tserver.compaction.service.custom.planner.opts.maxOpen=30
tserver.compaction.service.default.planner=org.apache.accumulo.core.spi.compaction.DefaultCompactionPlanner
tserver.compaction.service.default.planner.opts.executors=[{"name":"small","maxSize":"32M","numThreads":2},{"name":"large","numThreads":2}]
tserver.compaction.service.default.planner.opts.maxOpen=10

The follow table propertes were set for the continuous ingest test table.  These setting cause user compactions to go to the custom compaction service while all other compactions go to the default compactoin service.
table.compaction.configurer=org.apache.accumulo.core.client.admin.compaction.CompressionConfigurer
table.compaction.configurer.opts.large.compress.threshold=100M
table.compaction.configurer.opts.large.compress.type=gz
table.file.compress.type=snappy
table.compaction.major.ratio=2
table.compaction.dispatcher.opts.service.user=custom

Cluster Description

For each test 11 Standard_D8s_v3 nodes in a VM scale set were used.  Each node had 3 256G Standard_LRS managed disk.  Ten of the nodes were worker nodes running data node and tablet server processes.  One node was a leader node running ZooKeeper, Namenode, Accumulo Master, InfluxDB, Grafana, and Resource Manager.  The following software versions were used.

Apache Hadoop 3.2.1
Apache ZooKeeper 3.5.8
Centos 7
OpenJDK 11

Results for #1605 cluster

The following are metrics from the test.  It took around 27 hours to ingest the 60 billion key values.

The test did well in that as user compactions ran, the ingest rate and files per tablet looked good.  The following are some insights from the test.

You can notice when user compactions are forced above from the spike in queued compactions.
The ingest rate drops as the number of tablets goes up and when user compactions run.
In the beginning the number of files is higher because the initial files produced by minor compactions were larger than 32M.  This left the 2 threads configured for system compactions less than 32M idle, leaving only 2 threads for system compactions.  As the table split, minor compactions produced smaller files and these threads started being used.
When the user compactions started, all the existing files for all tablets were reserved.  This resulted in those files not being available for system compactions.  However new files coming in were still compacted by the system compaction service.  Need to open a follow on issue to consider delayed selection and explain the nuances.

The following is an example of a tablets files early on when there were not many tablets.  Notice the flush files were ~20M, so two of them would be a 40M system compaction which is greater than 32M.
3< file:hdfs://ctme3-0:8020/accumulo/tables/3/default_tablet/A00002m4.rf []    115579980,2857759
3< file:hdfs://ctme3-0:8020/accumulo/tables/3/default_tablet/F00002mb.rf []    24290599,393661
3< file:hdfs://ctme3-0:8020/accumulo/tables/3/default_tablet/F00002mu.rf []    22150678,355830
3< file:hdfs://ctme3-0:8020/accumulo/tables/3/default_tablet/F00002nc.rf []    18506358,294393
3< file:hdfs://ctme3-0:8020/accumulo/tables/3/default_tablet/F00002nu.rf []    19399829,306717
3< file:hdfs://ctme3-0:8020/accumulo/tables/3/default_tablet/F00002oc.rf []    25274863,400557
3< file:hdfs://ctme3-0:8020/accumulo/tables/3/default_tablet/F00002ov.rf []    21563605,355772
3< file:hdfs://ctme3-0:8020/accumulo/tables/3/default_tablet/F00002pe.rf []    26838046,424897

Below is animated gif of a single tablets files over the duration of this test.  While the test was running the metadata table was dumped to a file every 30 seconds.  The dump file name included a timestamp. This file name is included in the animated gif.  This can help one correlate the animated gifs with the plots.  There are a few things to observe.

In the beginning, the minor compactions (create files with F prefix) produce large files that can not be processed by the 32M thread pool.  This leads to more files per tablet in the beginning.
In the middle two user compactions ran for a while because the table has a lot of data.  These user compactions reserved files on each tablet, which were reserevd until the user compactions ran on each tablet.  In the animated gif this can be seen a set of stable files for a while as system compactions run on new files(files arriving after the user compaction reserved files).
Twoards the end, ingest has finished and the files are stable.  The last user compaction was allowed to complete before tearing down the cluster.


Below are plots of more stats for files per tablet than just the average collected by metrics.  This shows the standard deviation from the mean is low.  It also shows that the max is low between forced compactions, except for one spike that needs investigation (I suspect it corresponds with lots of tablet splitting).

At the end of the test 2.4 TB of data was in DFS.  Over the course of the test, compactions wrote 90.7 TB of data.

Results for Accumulo 2.0.0 cluster

The Accumulo 2.0 cluster did not do well on this test.  When compactions ran, the ingest rate dropped really low (more on this later).  Ingest hung after 41B entries were written, and the test did not complete.  Below are the metrics at the test run.

The reason ingest hung is that tablet servers could not write to HDFS.  The Namenode web page was consulted to see what the problem was, and it turns out HDFS was full.  Below is a screenshot from the namenode.

There should have been enough space if the Accumulo GC was running, so it was checked it. The Accumulo GC periodically compacts the metadata table.  These metadata compactions got hung because of the user compactions running on all tservers.  Below is the GC log showing it stuck.  The Accumulo GC got stuck waiting on user compactions and user compactions got stuck because there was no free space, a case of distributed dead lock.  The changes made in #1605 setup dedicated compaction services for the metadata table which would avoid this problem. A potential way to avoid this problem in 1.10 is via #1352 and setting flush instead of compact.

Before things crashed and burned, ingest was running slow during user compactions, why? Not quite sure, but one possibility is merging minor compactions.  The user initiated compactions were monopolizing all of the compactions threads, which left no threads to compact new incoming files.  When a tablet has too many files, it merges data flushed from memory with the smallest file which is called a merging minor compaction.  These merging minor compactions are slower, which prevents freeing memory, which causes writes to block.  Below is an example of tablet that experienced three merging minor compactions while it was waiting on a compaction thread.  Files created by a merging minor compaction have a M prefix.  This needs more investigation.
$ zcat dumps/dump.20200529120528.gz | grep '2;783342'
2;783342 file:hdfs://ctse2-0:8020/accumulo/tables/2/t-00011kl/A0004f4h.rf []    399467520,6186621
2;783342 file:hdfs://ctse2-0:8020/accumulo/tables/2/t-00011kl/C0007fyg.rf []    88578501,2146107
2;783342 file:hdfs://ctse2-0:8020/accumulo/tables/2/t-00011kl/C0007vrc.rf []    21274180,331188
2;783342 file:hdfs://ctse2-0:8020/accumulo/tables/2/t-00011kl/C00084ff.rf []    10651834,166333
2;783342 file:hdfs://ctse2-0:8020/accumulo/tables/2/t-00011kl/C00087q4.rf []    5250892,82548
2;783342 file:hdfs://ctse2-0:8020/accumulo/tables/2/t-00011kl/F0008a5c.rf []    1765929,28074
2;783342 file:hdfs://ctse2-0:8020/accumulo/tables/2/t-00011kl/F0008eel.rf []    1547682,24294
2;783342 file:hdfs://ctse2-0:8020/accumulo/tables/2/t-00011kl/F0008eya.rf []    1584400,25060
2;783342 file:hdfs://ctse2-0:8020/accumulo/tables/2/t-00011kl/F0008fxj.rf []    1676563,26242
2;783342 file:hdfs://ctse2-0:8020/accumulo/tables/2/t-00011kl/F0008h5w.rf []    1384477,22070
2;783342 file:hdfs://ctse2-0:8020/accumulo/tables/2/t-00011kl/F0008ipx.rf []    1340951,21203
2;783342 file:hdfs://ctse2-0:8020/accumulo/tables/2/t-00011kl/F0008k3h.rf []    1580577,24865
2;783342 file:hdfs://ctse2-0:8020/accumulo/tables/2/t-00011kl/M0008n0a.rf []    2481562,38493
2;783342 file:hdfs://ctse2-0:8020/accumulo/tables/2/t-00011kl/M0008obh.rf []    1343139,21603
2;783342 file:hdfs://ctse2-0:8020/accumulo/tables/2/t-00011kl/M0008obn.rf []    1319197,20760

Below are stats for the number of files per tablet over the course of the test.  In 2.0.0 when a tablet reaches 15 files, merging minor compactions start to occur.  The stats below for average files per tablets are pretty good.  Before running this test I expected the average to be much higher during user compactions, however I think the low ingest rate prevented this from happening.

The extreme slowdowns during user compactions were not seen in the cluster running #1605.  However that cluster only had 2 threads dedicated to user compactions.  The cluster for this test had 6 compaction threads and when a user compaction ran it used all of them.  So the much higher CPU use is another possible cause of the extreme drop in ingest rates.  However the #1605 cluster was using between 2 to 6 cores while user compactions were running, but it never stayed at 6.  The metrics screenshots show running compactions, but they are overshadowed by the queued compactions in the same plot.  A plot of only running compactions from the metrics system would have been useful.
Conclusion

The changes in #1605 will enable users to manage how disruptive user initiated compactions are to ingest and query. Getting the follow on work for metrics done will be critical to helping users successfully manage and tune multiple compaction executors.  This insight will help them adjust configuration based on their usages patterns and the metrics coming from Accumulo.  In addition to metrics, enabling runtime reconfiguration as described in #1609 will be important so users do not need to restart tablets servers.
Further cluster test against the #1605 changes would be useful.  It would be nice to run a simple multi-day test of bulk and/or live ingest w/o user initiated compactions.
Another test that may not far well for 2.0.0 would be user compactions against one table and ingest against another. This test ran ingest and user compactions against the same table.  So the user compactions helped lower the number of files per tablet from ingest somewhat.  If the actions were against separate tables, then the user compactions would completely starve the table doing ingest from running system compactions on its new files, probably leading to more merging minor compactions.
There are two factors that cause problems for the 2.0.0 cluster in these test.  One is that there is a single thread pool.  The second is the way queued work for the thread pool is prioritized. The queue is first prioritized by compaction type, with user compaction taking priority over system compactions.  Second its prioritized by the total number of files a tablet has, with tablet that have more files coming first.  In 2.0.0 and earlier this prioritization scheme is not configurable.  The changes in #1605 make prioritization configurable by the new user pluggable compaction planner.  The default compaction planner in #1605 prioritizes the same way as 2.0.0, but per thread pool.

  
## dump-metadata.sh
#!/bin/bash

# This is the script I ran on the cluster to dump the metadata table

while true
do
	accumulo shell -u root -p secret -e 'scan -t accumulo.metadata -b 3; -e 3< -c file -np' 2>/dev/null | grep file: | gzip > dumps/dump.$(date +%Y%m%d%H%M%S).gz
	sleep 30
done

## run_compactions.sh
#!/bin/bash

# this is the script I ran on the clusters to force compactions and sleep 3 hours

while true
do
  sleep 10800
  echo "Starting compaction $(date)"
	accumulo shell -u root -p secret -e 'compact -t ci -w'
  echo "Finished compaction $(date)"
done

## tablet-file-video.sh
#!/bin/bash

# This is the little script I used to created the animation of a tablets files from metadata table dumps.

counter=0

for f in dumps/dump.[0-9]*; do
	(echo $f; echo; zcat $f | grep '3<') | convert -size 850x330 xc:black -font /usr/share/fonts/truetype/noto/NotoMono-Regular.ttf -pointsize 14 -fill white -annotate +25+25 "@-"  frames/$(printf '%05d.png' $counter) &
	let counter=counter+1
	if [ $(($counter%20)) == 0 ]; then
		wait
	fi
done

wait

for i in {0..60}; do
	echo "That's all folks." |  convert -size 850x330 xc:black -font /usr/share/fonts/truetype/noto/NotoMono-Regular.ttf -pointsize 24 -fill white -annotate +25+25 "@-"  frames/$(printf '%05d.png' $counter)
	let counter=counter+1
done

cd frames
ffmpeg -framerate 30 -i '%05d.png' tablet-files.gif

## tablet-stats.sh
#!/bin/bash

# This is the little script that I used to create the per tablet stats data to plot from the
# metadata dump files.  datamash is awesome

for f in dumps/dump.[0-9]*; do
	echo -e -n "$f\t"
	zcat $f |  datamash -g 1 -W count 2 | datamash -W mean 2 sstdev 2 min 2 max 2
done

# I piped the output of this script to a stats.tsv file and used gnuplot with the following commands to create plots.
#
# set terminal png size 1024,640
# set output 'stats.png'
# plot 'stats.tsv' using 2 with lines title "mean", '' using 3 with lines title "stdev", '' using 4 with lines title "min", '' using 5 with lines title "max"
	#!/bin/bash

	# This is the script I ran on the cluster to dump the metadata table

	while true
	do
	accumulo shell -u root -p secret -e 'scan -t accumulo.metadata -b 3; -e 3< -c file -np' 2>/dev/null \| grep file: \| gzip > dumps/dump.$(date +%Y%m%d%H%M%S).gz
	sleep 30
	done
	#!/bin/bash

	# This is the little script I used to created the animation of a tablets files from metadata table dumps.

	counter=0

	for f in dumps/dump.[0-9]*; do
	(echo $f; echo; zcat $f \| grep '3<') \| convert -size 850x330 xc:black -font /usr/share/fonts/truetype/noto/NotoMono-Regular.ttf -pointsize 14 -fill white -annotate +25+25 "@-" frames/$(printf '%05d.png' $counter) &
	let counter=counter+1
	if [ $(($counter%20)) == 0 ]; then
	wait
	fi
	done

	wait

	for i in {0..60}; do
	echo "That's all folks." \| convert -size 850x330 xc:black -font /usr/share/fonts/truetype/noto/NotoMono-Regular.ttf -pointsize 24 -fill white -annotate +25+25 "@-" frames/$(printf '%05d.png' $counter)
	let counter=counter+1
	done

	cd frames
	ffmpeg -framerate 30 -i '%05d.png' tablet-files.gif
	#!/bin/bash

	# This is the little script that I used to create the per tablet stats data to plot from the
	# metadata dump files. datamash is awesome

	for f in dumps/dump.[0-9]*; do
	echo -e -n "$f\t"
	zcat $f \| datamash -g 1 -W count 2 \| datamash -W mean 2 sstdev 2 min 2 max 2
	done

	# I piped the output of this script to a stats.tsv file and used gnuplot with the following commands to create plots.
	#
	# set terminal png size 1024,640
	# set output 'stats.png'
	# plot 'stats.tsv' using 2 with lines title "mean", '' using 3 with lines title "stdev", '' using 4 with lines title "min", '' using 5 with lines title "max"