Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
CCR Recovery from Remote Benchmark plan

Purpose

Benchmark recovery from remote

Environment

We will use the existing workflow in: https://github.com/dliappis/stack-terraform/tree/ccr/projects/ccr

  • ES: 3 node clusters, security enabled
  • GCP: ES: custom-16-32768 16cpu / 32GB ram / min Skylake processor, Loaddriver: n1-standard-16 (16vcpu 60GB ram)
  • AWS: ES: c5d.4xlarge 16vcpu / 32GB ram, Loaddriver: m5d.4xlarge (16vcpu 64GB ram)
  • Locations:
    • AWS: Leader: eu-central-1 (Frankfurt) Follower: us-east-2 (Ohio)
    • GCP: Leader: eu-west-2 (Netherlands) Follower: us-central-1 (Iowa)
  • Index settings: 3P/1R
  • Rally tracks: geopoint / http_logs / pmc. Index settings (unless specifically changed): 3P/1R
  • OS: Ubuntu 16.04 4.4.0-1061-aws
  • Java version (unless specifically changed): OpenJDK 8 1.8.0_191

Collected metrics

  • metricbeat from each node (metricbeat enabled everywhere)
  • recovery-stats (from _recovery) every 1s
  • from some experiments: node-stats only for jvm/mem to provide heap usage and gc insights
  • median indexing throughput and the usual results Rally provides in the summary

Criteria to compare runs:

  • time taken to follower to complete recovery
  • indexing throughput
  • overall system usage (CPU, IO, Network)
  • remote tcp compression off vs on
  • Telemetry device collection frequency
    • "recovery-stats-sample-interval": 1

Benchmark combinations

Experiment 1: AWS, fully index corpora, initiate recovery, no tcp remote compression

Tracks: geopoint, http_logs, pmc Challenge: new challenge (to be created) executing in the following order: 1. Delete and create indices on leader 2. Index entire corpora 8 indexing clients using max performance (no target-throoughput set) and then stop (don't index any more) 3. Join follower, start recovery 4. Consider benchmark over when follower has recovered completely

Experiment 2: same as experiment 1, enabled remote tcp compression

Tracks: geopoint, http_logs, pmc Challenge: new challenge (to be created) executing in the following order: 1. Delete and create indices on leader 2. Index entire corpora 8 indexing clients using max performance (no target-throoughput set) and then stop (don't index any more) 3. Join follower, start recovery 4. Consider benchmark over when follower has recovered completely

Experiment 3: AWS, fully index corpora, initiate recovery, keep indexing at lower performance during recovery, remote tcp compression off

Tracks: geopoint, http_logs, pmc Challenge: new challenge (to be created) executing in the following order: 1. Delete and create indices on leader 2. Index some % of corpora, 8 indexing clients max performance (no target-throughput set) 3. Join follower, start recovery and keep indexing on leader at a lower throughput 4. Consider benchmark over when follower has recovered completely

Experiment 4: same as experiment 3, enabled remote tcp compression

Tracks: geopoint, http_logs, pmc Challenge: new challenge (to be created) executing in the following order: 1. Delete and create indices on leader 2. Index some % of corpora, 8 indexing clients max performance (no target-throughput set) 3. Join follower, start recovery and keep indexing on leader at a lower throughput 4. Consider benchmark over when follower has recovered completely

@dliappis

This comment has been minimized.

Copy link
Owner Author

@dliappis dliappis commented Feb 14, 2019

Experiment 1 results

All experiments included:

Conclusions:

  • Parallel recovery is essential for achieving good recovery from remote performance.
  • 1MB chunk size, 5 max concurrent file chunks (using smart file chunk fetching) is a good general default when no tcp compression has been enabled.

Detailed results per track below.

Geopoint

Indexed corpus with 3 shards, 1 replica:

green open leader tzH5IiTRQEKEGOQTRkuIiQ 3 1 60844404 0 8.3gb 4.9gb

1. Using non-configurable chunk_size: 64KB

Replcation took 3:59:44.90000

2. Using PR for configurable chunk size

Tried with 1MB chunk size and 5MB chunk size with 0 replicas.

Detailed results and graphs in the links above, in summary:

  • 1MB: recovery took: 0:14:04.368000
  • 5MB: recovery took: 0:04:47.406000

3. Using parallel recovery PR

Given the slowness of results in 1. and 2., it became evident the high latency (~100ms) requires parallelization of operations.

Used the PR defaults:

"ccr_recovery_chunk_size": "1MB",
"ccr_recovery_max_concurrent_file_chunks": 5

Results

Recovery from remote took: 0:01:13.300900

http_logs

All benchmarks used 3 primary shards and 0 replicas and targeted the 6.7 Elasticsearch branch.
Indexed corpus:

green open leader1 2ueZ8NwBT0mqWM6SJ-b9ZQ 3 0  12406628 0 633.5mb 633.5mb
green open leader2 Li7jlD5eSc6CR218hKX7Vw 3 0  41417502 0     2gb     2gb
green open leader3 0DevT85aRUut7voO6wh4WA 3 0 193424966 0   9.9gb   9.9gb

Iteration 1: using commit 808db1f

Results

Recovery from remote took: 0:05:53.016000

Network usage:

httplogs_baseline_defaults

JVM Heap usage:

httpslogs_baseline_jvm_heap_usage

GC Usage:

httplogs_baseline_gc_usage

Iteration 2: same commit as iter. 1, double chunk size: 5MB

Results

Recovery from remote took: 0:05:12.449000 (40seconds faster)

Network usage comparison against 1MB chunk size:

httplogs_5mb_vs_1mb_baseline_comparison_network

Iteration 3: using commit 38416aa

Used the changes introduced in PR#38841 for smarter CCR concurrent file fetch.

Used defaults: 1MB chunk size, 5 max concurrent file chunks

Results

Recovery from remote took: 0:03:05.924000, 47.3% drop compared to iteration 1

Network usage comparison against iteration 1:

http_logs_compare_smart_fetch_vs_before_using_defaults

Iteration 4: same commit as iter 3, higher defaults

Used: 5MB chunk size, default max concurrent file chunks: 5

Recovery from remote took: 0:02:40.525000

Network usage comparison against iteration 3:

http_logs_compare_smart_fetch_5mb_vs_1mb_chunk_size

Iteration 5: same commit as iter 3, higher defaults

Used: 1MB chunk size, increased max concurrent file chunks: 10

Recovery from remote took: 0:02:34.672000

Network usage comparison against iteration 3:

http_logs_compare_smart_fetch_5_vs_10_max_concurrent_file_chunks

Iteration 6: same commit as iter 3, even higher defaults

Used: 5MB chunk size, max concurrent file chunks: 8

Recovery from remote took: 0:03:38.003000

Iteration 7: same commit and params as iteration 6, using java-11 with AVX enabled

Used: 5MB chunk size, max concurrent file chunks: 8

Recovery from remote took: 0:02:36.914000

@dliappis

This comment has been minimized.

Copy link
Owner Author

@dliappis dliappis commented Feb 15, 2019

Experiment 2 results

All experiments included:

  • on branch 6.7
  • using commit: 38416aa
  • includingchanges introduced in PR#38841 for smarter CCR concurrent file fetch.
  • 3 primary shards, 0 replicas

http_logs

1. Using defaults 1MB chunk size, 5 max concurrent file chunks

Replication took: 0:03:13.690000

2. Using 5MB chunk size, 8 mac concurrent file chunks

Replication took: 0:02:24.404000

Network usage comparison between iteration 2 and iteration 1:

http_logs_remotecompress_network_usage_1mb_5conc-vs-5mb_8conc

CPU usage comparison between iteration 2 and iteration 1:

http_logs_cpu_usage_remotecompress_1mb_5conc-vs-5mb_8conc

CPU usage comparison between this and same parameters without compression (experiment 1, iteration 6):

http_logs_compare_cpu_user_between_compress_off_on_with_5mb_8parallelchunks

Focusing even further on the first node of the leader cluster (where compression takes place), we see a large difference from 2% peak user cpu usage to 22.3% user cpu usage:

http_logs_compare_user_cpu_cluster_b_0_between_compress_off_on_with_5mb_8parallelchunks

Network comparison between compression off (experiment 1, iteration 6) and compression on

http_logs_compare_network_between_compress_off_on_with_5mb_8parallelchunks

3. same commit and params as in 2., using java-11 with AVX enabled

The idea here is to compare if using java-11 decreases cpu usage when remote compression is enabled.

Conclusion: Using java-11 with AVX enabled didn't show any significant change in cpu usage on the follower, when remote compression is enabled.

Using 5MB chunk size, 8 mac concurrent file chunks

Replication took: 0:02:27.299000

user cpu (all nodes) comparison between remote_compress:off (experiment 1, iter 7) and this one:

http_logs_java11_remote_compress_off_vs_on_5mb_8

same chart, displaying user cpu only for follower node 3

http_logs_java11_cpu_user_follower_2_remote_compress_off_vs_on_5mb_8

compare user cpu between 2. and this (java 8 vs java 11)

http_logs_cpu_use_compare_java11_vs_java8_remote_compress_true_5mb_8conc

network usage comparison remote_compress:off (experiment 1, iter 7) and this one:

http_logs_compare_network_java11_compress_on_vs_off_5mb_8_concurrent

network usage comparison remote_compress:off and this one, focused on first leader node and first follower node

http_logs_per_node_comparison_java11_remote_compress_off_vs_on_5mb_8_chunks

@dliappis

This comment has been minimized.

Copy link
Owner Author

@dliappis dliappis commented Feb 22, 2019

Experiment 3 results (index up to % -> recovery and index slower)

All experiments included:

  • on branch 6.7
  • using commit: 38416aa
  • including changes introduced in PR#38841 for smarter CCR concurrent file fetch.
  • 3 primary shards, 0 replicas

http_logs

  1. Indexed at maximum indexing throughput with 8 clients for 700s with no replication
  2. Joined clusters and initiated recovery from remote 720s after start of benchmark
  3. Reduced indexing throughput at time point 2 to 80000doc/s until the end of the benchmark.

Performance Charts

http_logs_all_usages_corrected

@dliappis

This comment has been minimized.

Copy link
Owner Author

@dliappis dliappis commented Feb 26, 2019

Experiment 5 (Adhoc)

Purpose

Compare remote compression off/on on small nodes with limited CPU power

Configuration

Benchmark using 1 node clusters, 1 shard, using: c5d.xlarge instances:

ES node specs:

  • 4 vcpus
  • 7.5 GB ram
  • 92GB SSD
  • Heap: 3GB

All experiments included:

  • on branch 6.7
  • using commit: 38416aa
  • including changes introduced in PR#38841 for smarter CCR concurrent file fetch.
  • 1 primary shards, 0 replicas
  • 1MB chunk size, 5 max concurrent file chunks

Iteration 1, http_logs, remote_compress: off

recovery took: 0:08:30.843000

http_logs_remote_compress_off

Iteration 2, http_logs, remote_compress: on

recovery took: 0:11:09.063000

http_logs_remote_compress_on

Analysis between iteration 1 / iteration 2 (remote compress off vs on)

  1. CPU user usage: there is a phase of peak when all three indices are getting recovered and then stabilizes when only the remaining large index (leader3) gets recovered.

    remote.compress Peak cpu % (all indices) avg cpu % during indexing of last index
    OFF, leader 21% 5%
    ON, leader 66.18% 19%
    ON, follower 18.5% 5%
    OFF, follower 20.4% 5%
  2. Network usage:

    remote.compress Peak MB/s (all indices) avg MB/s during indexing of last index
    OFF, leader-out 91.6 MB/s 32 MB/s
    ON, leader-out 48.5 MB/s 16 MB/s
    OFF, follower-in 67.2 MB/s 33 MB/s
    ON, follower-in 51.6 MB/s 13 MB/s
  3. Chunk settings (1MB / 5 max concurrent file chunks) aren't large enough to saturate the network and compression ends up being slower.

  4. Recovery with remote compress off took 0:08:30.843000. Recovery compression on took: 0:11:09.063000

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment