These are notes from testing Accumulo 2.0.0-alpha-2 on S3. Accumulo was setup following these instructions. Used 10 m5d.2xlarge workers and one m5d.2xlarge master. Used HDFS running on clusters ephemeral storage for write ahead logs and metadata table files. Used two tier compaction strategy snappy for small files <100M and gz for larger files.
Ran continuous ingest for ~24hr. During this time 74 billion key values were ingested. I adjusted compaction settings twoards the end of the test and the ingest speed jumped. Opened #930 about this issue, need to describe the issue better.
After stopping ingest there were around 5120 tablets each with about 14 files per tablet. I tried running some queries at this time and it seems like a lookup took 3 to 4 seconds.
I let the cluster compact all the tablets down. It settled around 4 files per tablet and stopped compacting. I started doing random lookups and at first these took around 400ms. After running for a while, all of the file indexes were loaded in cache. When the indexes were in cache lookups started taking around 125ms. I configured an accumulo-ohc 24G data cache which seemed to hold around 13% of the data on each node. After this data cache was full lookups took around 100ms.
Checked total size in AWS web API before clearing bucket and it said 3.2TB
One interesting thing about this test was that only a small HDFS was needed. For this test each node on had 300G of NVMe SSD ephemeral storage and this was used for HDFS. Only needed enough space in HDFS to keep 6 or 7 1GB write ahead logs per tserver. I strongly suspect having the write ahead logs on SSD made the writes faster.
The only drawback of the small amount of local space on each node is that I was not able to run the continuous ingest verifcation map reduce job. The continous ingest test write linked list. The verification map reduce job looks for holes in the linked list which would indicate data loss. Since the files were in S3 I could have cloned+offlined the table and stood up another cluster to run a M/R verifcation job against the offline clone. This would have read the files directly from S3. Unfortunately I did not have time to try that.
While setting this test up to use ephemeral storage for HDFS, I wondered if EBS could be used for HDFS? And if so would it be safe to have less HDFS replicas when using EBS?