Benchmark recovery from remote
We will use the existing workflow in: https://github.com/dliappis/stack-terraform/tree/ccr/projects/ccr
- ES: 3 node clusters, security enabled
- GCP: ES: custom-16-32768 16cpu / 32GB ram / min Skylake processor, Loaddriver: n1-standard-16 (16vcpu 60GB ram)
- AWS: ES: c5d.4xlarge 16vcpu / 32GB ram, Loaddriver: m5d.4xlarge (16vcpu 64GB ram)
- Locations:
- AWS: Leader: eu-central-1 (Frankfurt) Follower: us-east-2 (Ohio)
- GCP: Leader: eu-west-2 (Netherlands) Follower: us-central-1 (Iowa)
- Index settings: 3P/1R
- Rally tracks: geopoint / http_logs / pmc. Index settings (unless specifically changed): 3P/1R
- OS: Ubuntu 16.04
4.4.0-1061-aws
- Java version (unless specifically changed):
OpenJDK 8 1.8.0_191
- metricbeat from each node (metricbeat enabled everywhere)
- recovery-stats (from
_recovery
) every 1s - from some experiments: node-stats only for jvm/mem to provide heap usage and gc insights
- median indexing throughput and the usual results Rally provides in the summary
Criteria to compare runs:
- time taken to follower to complete recovery
- indexing throughput
- overall system usage (CPU, IO, Network)
- remote tcp compression off vs on
- Telemetry device collection frequency
- "recovery-stats-sample-interval": 1
Tracks: geopoint, http_logs, pmc Challenge: new challenge (to be created) executing in the following order: 1. Delete and create indices on leader 2. Index entire corpora 8 indexing clients using max performance (no target-throoughput set) and then stop (don't index any more) 3. Join follower, start recovery 4. Consider benchmark over when follower has recovered completely
Tracks: geopoint, http_logs, pmc Challenge: new challenge (to be created) executing in the following order: 1. Delete and create indices on leader 2. Index entire corpora 8 indexing clients using max performance (no target-throoughput set) and then stop (don't index any more) 3. Join follower, start recovery 4. Consider benchmark over when follower has recovered completely
Experiment 3: AWS, fully index corpora, initiate recovery, keep indexing at lower performance during recovery, remote tcp compression off
Tracks: geopoint, http_logs, pmc Challenge: new challenge (to be created) executing in the following order: 1. Delete and create indices on leader 2. Index some % of corpora, 8 indexing clients max performance (no target-throughput set) 3. Join follower, start recovery and keep indexing on leader at a lower throughput 4. Consider benchmark over when follower has recovered completely
Tracks: geopoint, http_logs, pmc Challenge: new challenge (to be created) executing in the following order: 1. Delete and create indices on leader 2. Index some % of corpora, 8 indexing clients max performance (no target-throughput set) 3. Join follower, start recovery and keep indexing on leader at a lower throughput 4. Consider benchmark over when follower has recovered completely
Experiment 5 (Adhoc)
Purpose
Compare remote compression off/on on small nodes with limited CPU power
Configuration
Benchmark using 1 node clusters, 1 shard, using:
c5d.xlarge
instances:ES node specs:
All experiments included:
38416aa
Iteration 1, http_logs, remote_compress: off
recovery took: 0:08:30.843000
Iteration 2, http_logs, remote_compress: on
recovery took: 0:11:09.063000
Analysis between iteration 1 / iteration 2 (remote compress off vs on)
CPU
user
usage: there is a phase of peak when all three indices are getting recovered and then stabilizes when only the remaining large index (leader3
) gets recovered.Network usage:
Chunk settings (1MB / 5 max concurrent file chunks) aren't large enough to saturate the network and compression ends up being slower.
Recovery with remote compress off took
0:08:30.843000
. Recovery compression on took:0:11:09.063000