Skip to content

Instantly share code, notes, and snippets.

@dliappis
Last active March 1, 2019 08:09
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save dliappis/2e61295744d7e2f95a63394d771a86d4 to your computer and use it in GitHub Desktop.
Save dliappis/2e61295744d7e2f95a63394d771a86d4 to your computer and use it in GitHub Desktop.
CCR Recovery from Remote Benchmark plan

Purpose

Benchmark recovery from remote

Environment

We will use the existing workflow in: https://github.com/dliappis/stack-terraform/tree/ccr/projects/ccr

  • ES: 3 node clusters, security enabled
  • GCP: ES: custom-16-32768 16cpu / 32GB ram / min Skylake processor, Loaddriver: n1-standard-16 (16vcpu 60GB ram)
  • AWS: ES: c5d.4xlarge 16vcpu / 32GB ram, Loaddriver: m5d.4xlarge (16vcpu 64GB ram)
  • Locations:
    • AWS: Leader: eu-central-1 (Frankfurt) Follower: us-east-2 (Ohio)
    • GCP: Leader: eu-west-2 (Netherlands) Follower: us-central-1 (Iowa)
  • Index settings: 3P/1R
  • Rally tracks: geopoint / http_logs / pmc. Index settings (unless specifically changed): 3P/1R
  • OS: Ubuntu 16.04 4.4.0-1061-aws
  • Java version (unless specifically changed): OpenJDK 8 1.8.0_191

Collected metrics

  • metricbeat from each node (metricbeat enabled everywhere)
  • recovery-stats (from _recovery) every 1s
  • from some experiments: node-stats only for jvm/mem to provide heap usage and gc insights
  • median indexing throughput and the usual results Rally provides in the summary

Criteria to compare runs:

  • time taken to follower to complete recovery
  • indexing throughput
  • overall system usage (CPU, IO, Network)
  • remote tcp compression off vs on
  • Telemetry device collection frequency
    • "recovery-stats-sample-interval": 1

Benchmark combinations

Experiment 1: AWS, fully index corpora, initiate recovery, no tcp remote compression

Tracks: geopoint, http_logs, pmc Challenge: new challenge (to be created) executing in the following order: 1. Delete and create indices on leader 2. Index entire corpora 8 indexing clients using max performance (no target-throoughput set) and then stop (don't index any more) 3. Join follower, start recovery 4. Consider benchmark over when follower has recovered completely

Experiment 2: same as experiment 1, enabled remote tcp compression

Tracks: geopoint, http_logs, pmc Challenge: new challenge (to be created) executing in the following order: 1. Delete and create indices on leader 2. Index entire corpora 8 indexing clients using max performance (no target-throoughput set) and then stop (don't index any more) 3. Join follower, start recovery 4. Consider benchmark over when follower has recovered completely

Experiment 3: AWS, fully index corpora, initiate recovery, keep indexing at lower performance during recovery, remote tcp compression off

Tracks: geopoint, http_logs, pmc Challenge: new challenge (to be created) executing in the following order: 1. Delete and create indices on leader 2. Index some % of corpora, 8 indexing clients max performance (no target-throughput set) 3. Join follower, start recovery and keep indexing on leader at a lower throughput 4. Consider benchmark over when follower has recovered completely

Experiment 4: same as experiment 3, enabled remote tcp compression

Tracks: geopoint, http_logs, pmc Challenge: new challenge (to be created) executing in the following order: 1. Delete and create indices on leader 2. Index some % of corpora, 8 indexing clients max performance (no target-throughput set) 3. Join follower, start recovery and keep indexing on leader at a lower throughput 4. Consider benchmark over when follower has recovered completely

@dliappis
Copy link
Author

dliappis commented Feb 22, 2019

Experiment 3 results (index up to % -> recovery and index slower)

All experiments included:

  • on branch 6.7
  • using commit: 38416aa
  • including changes introduced in PR#38841 for smarter CCR concurrent file fetch.
  • 3 primary shards, 0 replicas

http_logs

  1. Indexed at maximum indexing throughput with 8 clients for 700s with no replication
  2. Joined clusters and initiated recovery from remote 720s after start of benchmark
  3. Reduced indexing throughput at time point 2 to 80000doc/s until the end of the benchmark.

Performance Charts

http_logs_all_usages_corrected

@dliappis
Copy link
Author

dliappis commented Feb 26, 2019

Experiment 5 (Adhoc)

Purpose

Compare remote compression off/on on small nodes with limited CPU power

Configuration

Benchmark using 1 node clusters, 1 shard, using: c5d.xlarge instances:

ES node specs:

  • 4 vcpus
  • 7.5 GB ram
  • 92GB SSD
  • Heap: 3GB

All experiments included:

  • on branch 6.7
  • using commit: 38416aa
  • including changes introduced in PR#38841 for smarter CCR concurrent file fetch.
  • 1 primary shards, 0 replicas
  • 1MB chunk size, 5 max concurrent file chunks

Iteration 1, http_logs, remote_compress: off

recovery took: 0:08:30.843000

http_logs_remote_compress_off

Iteration 2, http_logs, remote_compress: on

recovery took: 0:11:09.063000

http_logs_remote_compress_on

Analysis between iteration 1 / iteration 2 (remote compress off vs on)

  1. CPU user usage: there is a phase of peak when all three indices are getting recovered and then stabilizes when only the remaining large index (leader3) gets recovered.

    remote.compress Peak cpu % (all indices) avg cpu % during indexing of last index
    OFF, leader 21% 5%
    ON, leader 66.18% 19%
    ON, follower 18.5% 5%
    OFF, follower 20.4% 5%
  2. Network usage:

    remote.compress Peak MB/s (all indices) avg MB/s during indexing of last index
    OFF, leader-out 91.6 MB/s 32 MB/s
    ON, leader-out 48.5 MB/s 16 MB/s
    OFF, follower-in 67.2 MB/s 33 MB/s
    ON, follower-in 51.6 MB/s 13 MB/s
  3. Chunk settings (1MB / 5 max concurrent file chunks) aren't large enough to saturate the network and compression ends up being slower.

  4. Recovery with remote compress off took 0:08:30.843000. Recovery compression on took: 0:11:09.063000

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment