Skip to content

Instantly share code, notes, and snippets.

View keith-turner's full-sized avatar
👍
17.9 % chance that I am coding

Keith Turner keith-turner

👍
17.9 % chance that I am coding
View GitHub Profile
@keith-turner
keith-turner / experiments.md
Last active September 22, 2023 01:17
3761 Experiments

Experiments with the changes from #3761

Below is Accumulo shell output

root@uno> createtable foo
root@uno foo> insert 1 f q 1
root@uno foo> insert 2 f q 2
root@uno foo> insert 3 f q 3
root@uno foo> insert 4 f q 4
@keith-turner
keith-turner / AccumuloOfflineScanTest.md
Last active December 30, 2022 20:39
Scan server test programs

Testing Accumulo offline scans

Wrote some test programs to excercise offline scans in Accumulo. These programs are expected to run in seperate processes, this is to prevent table operations from clearing the client side tablet cache used by scans.

# start some scan servers
accumulo WriteRead accumulo-client.properties &> writeread.log &
accumulo ModifyTable accumulo-client.properties &> modifytable.log &
@keith-turner
keith-turner / experiment1.md
Last active December 7, 2022 12:15
Accumulo compaction drop behind experiment

Accumulo compaction drop behind experiment

This a summary of test run to see if the drop behind settings make a noticable difference for Accumulo compactions. No differences were seen. A test with C code was run and differences were seen. One difference between the C and Accumulo code is the C code is only reading data. Further investigation is needed. Not sure if there is a bug in Hadoop/Accumulo or if there was a problem with the test.

Setup

These test were run using this commint from this branch which is a modified verions of #3083

To generate data for Accumulo to compact the following accumulo-testing command was run. Test were conducted on a laptop with 16G of RAM and a single DN and tserver setup by Uno.

@keith-turner
keith-turner / Ecoji2-proposal-analysis.md
Last active June 21, 2021 01:09
Ecoji 2 proposal analysis

This document represents an analysis of the Emojis proposed in ecoji#29.

Column Description
Code Point
Emoji
Candidate True if the emoji exists in emoji-test.txt and is a single code point when fully qualified.
v1 ord The 10-bit code that Ecoji V1 assigns to this emoji. Its -1 when Ecoji V1 does not use the emoji.
v2 ord The 10-bit code that Ecoji V2 assigns to this emoji Its -1 when Ecoji V2 does not use the emoji.
@keith-turner
keith-turner / compaction_comp.md
Last active June 1, 2020 18:07
Test of new Accumulo compaction code

Introduction

Accumulo users sometimes filter or transform data via compactions. In current releases of Accumulo, these user initiated compactions can be disruptive to data currently being written. To improve this situation, PR #1605 was created for the next release of Accumulo. This PR enables dedicating resources to user initiated compactions. To verify if the PR is effective test with heavy ingest and concurrent user compactions were run. These test were run on two Azure clusters. One cluster had a version of Accumulo containing the changes in #1605. The other cluster had Accumulo 2.0.0. This document describes the test and the outcomes and show that the changes in #1605 were beneficial in this scenario.

Terminology

  • Tablet : Each Accumulo table is divided into tablets. Each tablet has a list of files in DFS where it stores the data in its range.
  • Minor compaction : When data is written to an Accumulo tablet its buffered into memory.
@keith-turner
keith-turner / usecases.md
Last active May 14, 2020 03:50
Accumulo Compaction Use Cases

This document is a work in progress and goes with #1605

Different compression algorithms

Users can get better throughput without sacrificing storage space by using snappy for small compactions and gzip for large compactions. This can be achieved by configuring the CompactionConfigurer implementation CompressionConfigurer for a table. After configured this would be used for all compactions, unless a user initiated compaction specified a CompactionConfigurer.

Selectively filtering data

For many reasons users may wish to filter data from an Accumulo table. One example use case would be that unwanted data was erroneously written to a table.

@keith-turner
keith-turner / runIT.sh
Created February 18, 2020 17:27
Script to run individual integration test for Accumulo
#!/bin/bash
mvn -Dit.test="$1*" -Dtest=94w5up8qtweh -PskipQA -DskipITs=false -DskipTests=false -DfailIfNoTests=false verify
@keith-turner
keith-turner / compaction-algorithm.md
Last active December 11, 2019 23:21
A Proposed Modification to Accumulo's Compaction Algorithm

A Proposed Modification to Accumulo's Compaction Algorithm

By default, [compactions][1] in Accumulo are driven by a configurable compaction ratio using the following algorithm.

  1. If LF * CR < SUM then compact this set of files. LF is the largest file in the set. CR is the compaction ratio. SUM is the size of all files in the set.
  2. Remove largest file from set.
  3. If set is empty, then compact no files.
  4. Go to 1.
@keith-turner
keith-turner / CBI.java
Created March 13, 2019 19:23
Accumulo client code to create continuous bulk import load. Created to test changes for apache/accumulo#979
package cmd;
import java.net.URI;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
import java.util.Random;
import org.apache.accumulo.core.client.Connector;
import org.apache.accumulo.core.client.rfile.RFile;
@keith-turner
keith-turner / accumulo-s3-notes.md
Last active August 20, 2019 09:37
Notes from testing Accumulo 2.0.0-alpha-2 with S3.

These are notes from testing Accumulo 2.0.0-alpha-2 on S3. Accumulo was setup following these instructions. Used 10 m5d.2xlarge workers and one m5d.2xlarge master. Used HDFS running on clusters ephemeral storage for write ahead logs and metadata table files. Used two tier compaction strategy snappy for small files <100M and gz for larger files.

Ran continuous ingest for ~24hr. During this time 74 billion key values were ingested. I adjusted compaction settings twoards the end of the test and the ingest speed jumped. Opened #930 about this issue, need to describe the issue better.

After stopping ingest there were around 5120 tablets each with about 14 files per tablet. I tried running some queries at this time and it seems like a lookup took 3 to 4 seconds.

I let the cluster compact all the tablets down. It settled around 4 files per tablet and stopped compacting. I started doing