dliappis/recovery-from-remote-plan.md Secret

## recovery-from-remote-plan.md

      
    Raw
  

              recovery-from-remote-plan.md
            
          
    Purpose

Benchmark recovery from remote
Environment

We will use the existing workflow in: https://github.com/dliappis/stack-terraform/tree/ccr/projects/ccr

ES: 3 node clusters, security enabled
GCP: ES: custom-16-32768 16cpu / 32GB ram / min Skylake processor, Loaddriver: n1-standard-16 (16vcpu 60GB ram)
AWS: ES: c5d.4xlarge 16vcpu / 32GB ram, Loaddriver: m5d.4xlarge (16vcpu 64GB ram)
Locations:

AWS: Leader: eu-central-1 (Frankfurt) Follower: us-east-2 (Ohio)
GCP: Leader: eu-west-2 (Netherlands) Follower: us-central-1 (Iowa)


Index settings: 3P/1R
Rally tracks: geopoint / http_logs / pmc. Index settings (unless specifically changed): 3P/1R
OS: Ubuntu 16.04 4.4.0-1061-aws
Java version (unless specifically changed): OpenJDK 8 1.8.0_191

Collected metrics


metricbeat from each node (metricbeat enabled everywhere)
recovery-stats (from _recovery) every 1s
from some experiments: node-stats only for jvm/mem to provide heap usage and gc insights
median indexing throughput and the usual results Rally provides in the summary

Criteria to compare runs:

time taken to follower to complete recovery
indexing throughput
overall system usage (CPU, IO, Network)
remote tcp compression off vs on
Telemetry device collection frequency

"recovery-stats-sample-interval": 1


Benchmark combinations

Experiment 1: AWS, fully index corpora, initiate recovery, no tcp remote compression

Tracks: geopoint, http_logs, pmc
Challenge: new challenge (to be created) executing in the following order:
1. Delete and create indices on leader
2. Index entire corpora 8 indexing clients using max performance (no target-throoughput set) and then stop (don't index any more)
3. Join follower, start recovery
4. Consider benchmark over when follower has recovered completely
Experiment 2: same as experiment 1, enabled remote tcp compression

Tracks: geopoint, http_logs, pmc
Challenge: new challenge (to be created) executing in the following order:
1. Delete and create indices on leader
2. Index entire corpora 8 indexing clients using max performance (no target-throoughput set) and then stop (don't index any more)
3. Join follower, start recovery
4. Consider benchmark over when follower has recovered completely
Experiment 3: AWS, fully index corpora, initiate recovery, keep indexing at lower performance during recovery, remote tcp compression off

Tracks: geopoint, http_logs, pmc
Challenge: new challenge (to be created) executing in the following order:
1. Delete and create indices on leader
2. Index some % of corpora, 8 indexing clients max performance (no target-throughput set)
3. Join follower, start recovery and keep indexing on leader at a lower throughput
4. Consider benchmark over when follower has recovered completely
Experiment 4: same as experiment 3, enabled remote tcp compression

Tracks: geopoint, http_logs, pmc
Challenge: new challenge (to be created) executing in the following order:
1. Delete and create indices on leader
2. Index some % of corpora, 8 indexing clients max performance (no target-throughput set)
3. Join follower, start recovery and keep indexing on leader at a lower throughput
4. Consider benchmark over when follower has recovered completely
remote.compress	Peak cpu % (all indices)	avg cpu % during indexing of last index
OFF, leader	21%	5%
ON, leader	66.18%	19%
ON, follower	18.5%	5%
OFF, follower	20.4%	5%
remote.compress	Peak MB/s (all indices)	avg MB/s during indexing of last index
OFF, leader-out	91.6 MB/s	32 MB/s
ON, leader-out	48.5 MB/s	16 MB/s
OFF, follower-in	67.2 MB/s	33 MB/s
ON, follower-in	51.6 MB/s	13 MB/s