Skip to content

Instantly share code, notes, and snippets.

@dliappis
Last active June 30, 2020 15:12
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save dliappis/54a9eedd5a272cf13b49ecaa85f87c7f to your computer and use it in GitHub Desktop.
Save dliappis/54a9eedd5a272cf13b49ecaa85f87c7f to your computer and use it in GitHub Desktop.
Linux Multi-queue and Elasticsearch corrupt index exception bug reproduction scripts

Reproduction steps for Elasticsearch CorruptIndexException

This workflow spins up 3 GCP vms in us-central1-a (by default) using the image ubuntu-1604-xenial-v20190807 from the image project ubuntu-os-cloud.

The first VM is used as a load driver and the other two host a two node Elasticsearch cluster.

Prerequisites

  1. A Linux or macOS workstation and a bash shell.
  2. gcloud cli tool installed and configured for your account.
  3. export SSH_PUB_KEY=<path to your ssh public key>; if unset defaults to ~/.ssh/id_rsa.pub
  4. export GCP_PROJECT=<your gcp project>

Execute

  1. Run ./start_vms.sh

This will print the IP addresses of the load driving machine and the IPs of the Elasticsearch nodes. To ssh one the Elasticsearch nodes just use ssh <public_ip_of_es_node> (no username needed).

It will also start a stress-test benchmark on the load driver machine inside a tmux session.

The reproduction may take 2 days and will cause an error on the loaddriver tmux session. An entry will appear in /var/log/elasticsearch/elasticsearch.log on either node looking like:

org.apache.lucene.index.CorruptIndexException: checksum failed (hardware problem?)

Teardown

  1. Run ./stop_vms.sh
#!/usr/bin/env bash
set -eo pipefail
echo "debconf debconf/frontend select Noninteractive" | sudo debconf-set-selections
wget -qO - https://artifacts.elastic.co/GPG-KEY-elasticsearch | sudo apt-key add -
sudo apt-get install apt-transport-https
echo "deb https://artifacts.elastic.co/packages/7.x/apt stable main" | sudo tee -a /etc/apt/sources.list.d/elastic-7.x.list
sudo apt-get update && sudo apt-get install -y unzip && sudo apt-get install -y elasticsearch=7.6.2
sudo mdadm --create /dev/md0 --name md0 --level=0 --raid-devices=2 /dev/sdb /dev/sdc
sudo mkfs.ext4 -L md0 /dev/md0
sudo mount /dev/md0 /var/lib/elasticsearch
sudo bash -c "cat >>/etc/fstab" <<EOF
LABEL=md0 /var/lib/elasticsearch ext4 defaults,nofail 0 1
EOF
sudo chown -R elasticsearch:elasticsearch /var/lib/elasticsearch
export CWD="$(pwd)"
mkdir -p $CWD/certificates
cat >$CWD/certificates/ca.crt <<EOF
-----BEGIN CERTIFICATE-----
MIIDnTCCAoWgAwIBAgIUDG8qaJy82FTuJEiL5a7EnYhITHMwDQYJKoZIhvcNAQEL
BQAwNDEyMDAGA1UEAxMpRWxhc3RpYyBDZXJ0aWZpY2F0ZSBUb29sIEF1dG9nZW5l
cmF0ZWQgQ0EwIBcNMTcwNDIxMDgzNzE2WhgPMjI5MTAyMDMwODM3MTZaMDQxMjAw
BgNVBAMTKUVsYXN0aWMgQ2VydGlmaWNhdGUgVG9vbCBBdXRvZ2VuZXJhdGVkIENB
MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAxLYZs3QX/fZ4ikxAZhBe
Q1QayqsHfU8A3P9Q2BoEOtP3goPDu9r9Lpaj1yoq2KEy59VW9jYFgbi8mlrL+tWP
3ChdM+skhHxoFku3cIs2+U5qtzBFa717EAvbQcWvSKQyHIkmckTo0InlLk0QS49Z
dyhU6sw7L5jxZTH5t4Ix8STwi9K3x4D2pfrcW3jI4yAd09d9jpf+szR3MimwyX+Q
qHSWdqbSPNFxrd7wEzAj/hcoouqL7tPKq2aLfwL1qSWMfCFHFp+H/X8PxNVjzyHX
9WI0u6nQekPgocCxU/JO/Q0AhHKhz4m0a9InpIop/c2GssS4FEfWvSbbvfrtz47D
2wIDAQABo4GkMIGhMB0GA1UdDgQWBBQEvbcIhMaWXp2HgipoIWFjUrHj5zBvBgNV
HSMEaDBmgBQEvbcIhMaWXp2HgipoIWFjUrHj56E4pDYwNDEyMDAGA1UEAxMpRWxh
c3RpYyBDZXJ0aWZpY2F0ZSBUb29sIEF1dG9nZW5lcmF0ZWQgQ0GCFAxvKmicvNhU
7iRIi+WuxJ2ISExzMA8GA1UdEwEB/wQFMAMBAf8wDQYJKoZIhvcNAQELBQADggEB
AAYwh6vNb0ata3HPY2iRmJqLu61p5L7wa/XaEE+pBGLoi/0FWhKYFbL0BhlaTp2O
xJn5irAAA7bVK2aNykhCVKbparJpSb+lSMtkPNSzvSODv7uUGwmU8+KU0pTZDki2
LVm6zNbrRROAjJcPllghqfKcGLjFEaX1D87XmQJJfzJymaIeHtT/xQMfM7Roi6TE
hB4gXLZTs1pY1fWhcsS6wvrNdTlGTaipABMUPFtf3K/GQnnT87YOS+Ce+dfGN+SI
hVw2jukDZwZxmL3GtddVkPP4fS/OEkUC8l3/QvgL2NEKMAZV3QISnEAbXNHzrLxQ
5yoUeq1gOEyquAzlasQD1/U=
-----END CERTIFICATE-----
EOF
cat >$CWD/certificates/ca.key <<EOF
-----BEGIN RSA PRIVATE KEY-----
MIIEowIBAAKCAQEAxLYZs3QX/fZ4ikxAZhBeQ1QayqsHfU8A3P9Q2BoEOtP3goPD
u9r9Lpaj1yoq2KEy59VW9jYFgbi8mlrL+tWP3ChdM+skhHxoFku3cIs2+U5qtzBF
a717EAvbQcWvSKQyHIkmckTo0InlLk0QS49ZdyhU6sw7L5jxZTH5t4Ix8STwi9K3
x4D2pfrcW3jI4yAd09d9jpf+szR3MimwyX+QqHSWdqbSPNFxrd7wEzAj/hcoouqL
7tPKq2aLfwL1qSWMfCFHFp+H/X8PxNVjzyHX9WI0u6nQekPgocCxU/JO/Q0AhHKh
z4m0a9InpIop/c2GssS4FEfWvSbbvfrtz47D2wIDAQABAoIBADlIwnFE9Jurg+za
ScKvL5Qx0N+GMNcoA5tX6qYT5XlwMtraHkz9d89yZOIK0JFnWBi1Qu7OSoo9Twcw
O8ifGpbFVmcBKhA+3lznzdLDZ83wLRmNwBmhA05n9YDQ3busvT8cHYsXUCkyjwAN
xxoJ88bEgv4hXXb99gY/KHZtPrf3Q4nSUn2k+aIZDRrkpg9UcxPTXIsNTrHXbDeT
v9GHOYWo1i2O3xAaowN2P57QV4JsCH23Ub5l2EZDxhh4+NT4bsfD/BiZe8UNceUJ
BMLtCPwDQjjyzCBtoIG4c4Dt6jQjFZSV28TAr7ZcK8OeUfDdlCr3bCTow82LVe4M
y7sa8BECgYEA+GF4JyiCtZBb0z8yZTrG6R5nKZIY7KnuXuf6ukl46MZGo9OvOQlZ
9ovB1NgCm7ENhRs0R9fdOanDisvQ6dIJkkmqqRMenKxG01qyJDIYs6MLG5s84otC
xmz/S2suKT24/ptjVw/oHA6PeKgkYKIlEBZhMqoFYMoppiSuNhzF9uMCgYEAyr7f
MfYRihSo4OWhhb32568I2ZuZFT13ABdtMnEtLd/QWW8HXVYNG+UT9Qc2utjJlvOS
6i5/CqPf3oFWARVRgFFbsqZFUToon4kEYufR1qfjrwj91hufGllVymLPxnuV0ZlI
YDc1QpVN7CskpGoOsEFsNjPveWRwqv16CCsYmKkCgYEAk6nOtuj8nFiQXsxpd4k0
DA+JIUu8CacVEdM0Wl+nxCtsf6UvvOb0VwDLYXByTIE8GnAL6tJIsSleGTwGnZvD
GPc2wIGfZ2F8UdbPpXkq+lDqH6Vw0vYb4r+WHw4/SUFqo+NZcb8BLPzzCrZbuh9r
jV7gtjAiNmK51A5mi8EbaCUCgYA8nbiJbXJtACRFqSITpGoPdsuEk/q+2POdOWPS
cvf5ATN/qaxgAXxF3MWMuq1oS6xpz0Ubcu9UtQ4Xrj+Sb1dAsBJkZUXQNT00BXkk
QP8B2IxAJsYNn5CABjmaGtTYGNcAJX34Fkl8MLttYrC/312o4MaDph9xAdCVrtcv
XgMqkQKBgF+BpjaU/DqKje3GGuLDv/6uCHxmKYfq4rnQ5YQYPEFbQpPVbKlINY99
hC8f/Y2qLKPlc9ikotAlvxgfXy+v1dlyrpnRWXtSeMAPxDtNiCsoGByQY5/zrfH7
upaN3ad/n/X4xGPyx6Ri2D+mbLYzWOXIrDLuPuJVhMxSi1feb9M/
-----END RSA PRIVATE KEY-----
EOF
cat >$CWD/certificates/instances.yml <<EOF
---
instances:
- name: ${ES_NODE_NAME}
ip: ${PRIVATE_IP_ES_NODE}
EOF
sudo CWD=$CWD su -s /bin/bash -c '/usr/share/elasticsearch/bin/elasticsearch-certutil \
cert \
--silent \
--ca-cert $CWD/certificates/ca.crt \
--ca-key $CWD/certificates/ca.key \
--in $CWD/certificates/instances.yml \
--out /etc/elasticsearch/node-certs.zip \
--pass ""'
sudo su -s /bin/bash -c 'cd /etc/elasticsearch; unzip /etc/elasticsearch/node-certs.zip; rm /etc/elasticsearch/node-certs.zip'
sudo chown -R root:elasticsearch /etc/elasticsearch/$(hostname)
sudo su -s /bin/bash -c "
cd /etc/elasticsearch;
cat >>elasticsearch.yml <<EOF_ES
node.name: ${ES_NODE_NAME}
network.host: $PRIVATE_IP_ES_NODE
cluster.initial_master_nodes: $ES_NODE_PRIVATE_IPS
discovery.seed_hosts: $ES_OTHER_NODE_IPS
xpack.ml.enabled: false
xpack.monitoring.enabled: false
xpack.security.enabled: true
xpack.watcher.enabled: false
xpack.security.transport.ssl.enabled: true
xpack.security.transport.ssl.verification_mode: full
xpack.security.transport.ssl.keystore.path: ${ES_NODE_NAME}/${ES_NODE_NAME}.p12
xpack.security.transport.ssl.truststore.path: ${ES_NODE_NAME}/${ES_NODE_NAME}.p12
xpack.security.http.ssl.enabled: true
xpack.security.http.ssl.keystore.path: ${ES_NODE_NAME}/${ES_NODE_NAME}.p12
xpack.security.authc.accept_default_password: false
xpack.security.authc.token.enabled: false
EOF_ES
cat >>jvm.options <<EOF_ES
-Xms16g
-Xmx16g
EOF_ES
"
sudo su -s /bin/bash -c "
cd /usr/share/elasticsearch
if [[ ! -f /etc/elasticsearch/elasticsearch.keystore ]]; then
bin/elasticsearch-keystore create
fi
echo 'some-secret-password' | bin/elasticsearch-keystore add -x 'bootstrap.password'
"
sudo systemctl enable elasticsearch.service
sudo systemctl start elasticsearch.service
#!/usr/bin/env bash
set -eo pipefail
echo "debconf debconf/frontend select Noninteractive" | sudo debconf-set-selections
sudo mdadm --create /dev/md0 --name md0 --level=0 --raid-devices=2 /dev/sdb /dev/sdc
sudo mkfs.ext4 -L md0 /dev/md0
mkdir -p ~/.rally
RALLY_DIR="$HOME/.rally"
sudo mount /dev/md0 ~/.rally
sudo bash -c 'cat >>/etc/fstab' <<EOF
LABEL=md0 ${RALLY_DIR} ext4 defaults,nofail 0 1
EOF
sudo chown -R $(id -u):$(id -g) ~/.rally
sudo apt-get -y update
sudo apt-get -y install gcc python3-pip python3-dev git tmux
sudo pip3 install esrally==${RALLY_VERSION}
esrally configure
# Add eventdata track in rally.ini [tracks] section
sed -i '/^\[tracks\]$/a\eventdata.url = https://github.com/elastic/rally-eventdata-track' ~/.rally/rally.ini
mkdir -p certificates
cat >certificates/ca.crt <<EOF
-----BEGIN CERTIFICATE-----
MIIDnTCCAoWgAwIBAgIUDG8qaJy82FTuJEiL5a7EnYhITHMwDQYJKoZIhvcNAQEL
BQAwNDEyMDAGA1UEAxMpRWxhc3RpYyBDZXJ0aWZpY2F0ZSBUb29sIEF1dG9nZW5l
cmF0ZWQgQ0EwIBcNMTcwNDIxMDgzNzE2WhgPMjI5MTAyMDMwODM3MTZaMDQxMjAw
BgNVBAMTKUVsYXN0aWMgQ2VydGlmaWNhdGUgVG9vbCBBdXRvZ2VuZXJhdGVkIENB
MIIBIjANBgkqhkiG9w0BAQEFAAOCAQ8AMIIBCgKCAQEAxLYZs3QX/fZ4ikxAZhBe
Q1QayqsHfU8A3P9Q2BoEOtP3goPDu9r9Lpaj1yoq2KEy59VW9jYFgbi8mlrL+tWP
3ChdM+skhHxoFku3cIs2+U5qtzBFa717EAvbQcWvSKQyHIkmckTo0InlLk0QS49Z
dyhU6sw7L5jxZTH5t4Ix8STwi9K3x4D2pfrcW3jI4yAd09d9jpf+szR3MimwyX+Q
qHSWdqbSPNFxrd7wEzAj/hcoouqL7tPKq2aLfwL1qSWMfCFHFp+H/X8PxNVjzyHX
9WI0u6nQekPgocCxU/JO/Q0AhHKhz4m0a9InpIop/c2GssS4FEfWvSbbvfrtz47D
2wIDAQABo4GkMIGhMB0GA1UdDgQWBBQEvbcIhMaWXp2HgipoIWFjUrHj5zBvBgNV
HSMEaDBmgBQEvbcIhMaWXp2HgipoIWFjUrHj56E4pDYwNDEyMDAGA1UEAxMpRWxh
c3RpYyBDZXJ0aWZpY2F0ZSBUb29sIEF1dG9nZW5lcmF0ZWQgQ0GCFAxvKmicvNhU
7iRIi+WuxJ2ISExzMA8GA1UdEwEB/wQFMAMBAf8wDQYJKoZIhvcNAQELBQADggEB
AAYwh6vNb0ata3HPY2iRmJqLu61p5L7wa/XaEE+pBGLoi/0FWhKYFbL0BhlaTp2O
xJn5irAAA7bVK2aNykhCVKbparJpSb+lSMtkPNSzvSODv7uUGwmU8+KU0pTZDki2
LVm6zNbrRROAjJcPllghqfKcGLjFEaX1D87XmQJJfzJymaIeHtT/xQMfM7Roi6TE
hB4gXLZTs1pY1fWhcsS6wvrNdTlGTaipABMUPFtf3K/GQnnT87YOS+Ce+dfGN+SI
hVw2jukDZwZxmL3GtddVkPP4fS/OEkUC8l3/QvgL2NEKMAZV3QISnEAbXNHzrLxQ
5yoUeq1gOEyquAzlasQD1/U=
-----END CERTIFICATE-----
EOF
cat >client-options.json <<EOF
{
"default": {
"use_ssl": true,
"ca_certs":"$PWD/certificates/ca.crt",
"verify_certs": true,
"basic_auth_user": "elastic",
"basic_auth_password": "some-secret-password",
"timeout": 240,
"request_timeout": 240
}
}
EOF
cat >track-params.json <<EOF
{
"bulk_indexing_clients": 8,
"bulk_indexing_iterations": 2200000,
"target_throughput": 16,
"bulk_size": 1000,
"number_of_shards": 2,
"number_of_replicas": 1,
"index_refresh_interval": -1
}
EOF
cat >target-hosts.json <<EOF
{
"default": [
${TARGET_HOSTS}
]
}
EOF
# And finally ... start rally in a tmux session called "Rally"
tmux new-session -d -s "Rally" \
esrally \
--on-error=abort \
--track-repository=eventdata \
--track=eventdata \
--track-revision="$EVENTDATA_TRACK_REVISION" \
--challenge=bulk-update \
--track-params=./track-params.json \
--pipeline=benchmark-only \
--target-hosts=./target-hosts.json \
--client-options=./client-options.json
#!/usr/bin/env bash
source variables.sh
# Load driver
gcloud compute \
--project=$GCP_PROJECT \
instances create ${LOADDRIVER_NAME} \
--zone=${GCP_ZONE} \
--machine-type=n1-standard-16 \
--subnet=default \
--network-tier=PREMIUM \
--metadata=node-number=0,ssh-keys="${SSH_KEYS}" \
--no-restart-on-failure --maintenance-policy=MIGRATE \
--scopes=https://www.googleapis.com/auth/cloud-platform \
--min-cpu-platform="Intel Skylake" \
--tags=es-node \
--image=${IMAGE_NAME} \
--image-project=${IMAGE_PROJECT} \
--boot-disk-type=pd-ssd \
--boot-disk-device-name=${LOADDRIVER_NAME} \
--local-ssd=interface=SCSI \
--local-ssd=interface=SCSI
# ES nodes
for es_node_name in "${ES_NODE_NAMES[@]}"
do
gcloud compute \
--project=$GCP_PROJECT \
instances create ${es_node_name} \
--zone=${GCP_ZONE} \
--machine-type=custom-16-32768 \
--subnet=default \
--network-tier=PREMIUM \
--metadata=node-number=0,ssh-keys="${SSH_KEYS}" \
--no-restart-on-failure \
--maintenance-policy=MIGRATE \
--scopes=https://www.googleapis.com/auth/cloud-platform \
--min-cpu-platform="Intel Skylake" \
--tags=es-node \
--image=${IMAGE_NAME} \
--image-project=${IMAGE_PROJECT} \
--boot-disk-type=pd-ssd \
--boot-disk-device-name=${es_node_name} \
--local-ssd=interface=SCSI \
--local-ssd=interface=SCSI
done
PRIVATE_IP_LOADDRIVER=$(gcloud compute --project=$GCP_PROJECT instances describe --zone=${GCP_ZONE} ${LOADDRIVER_NAME} --format='get(networkInterfaces[0].networkIP)')
PRIVATE_IP_ES_NODE_0=$(gcloud compute --project=$GCP_PROJECT instances describe --zone=${GCP_ZONE} ${ES_NODE_0_NAME} --format='get(networkInterfaces[0].networkIP)')
PRIVATE_IP_ES_NODE_1=$(gcloud compute --project=$GCP_PROJECT instances describe --zone=${GCP_ZONE} ${ES_NODE_1_NAME} --format='get(networkInterfaces[0].networkIP)')
PUBLIC_IP_LOADDRIVER=$(gcloud compute --project=$GCP_PROJECT instances describe --zone=${GCP_ZONE} ${LOADDRIVER_NAME} --format='get(networkInterfaces[0].accessConfigs[0].natIP)')
PUBLIC_IP_ES_NODE_0=$(gcloud compute --project=$GCP_PROJECT instances describe --zone=${GCP_ZONE} ${ES_NODE_0_NAME} --format='get(networkInterfaces[0].accessConfigs[0].natIP)')
PUBLIC_IP_ES_NODE_1=$(gcloud compute --project=$GCP_PROJECT instances describe --zone=${GCP_ZONE} ${ES_NODE_1_NAME} --format='get(networkInterfaces[0].accessConfigs[0].natIP)')
ES_NODE_PRIVATE_IPS="${PRIVATE_IP_ES_NODE_0},${PRIVATE_IP_ES_NODE_1}"
TARGET_HOSTS="\\\"${PRIVATE_IP_ES_NODE_0}:9200\\\",\\\"${PRIVATE_IP_ES_NODE_1}:9200\\\""
cat configure_elasticsearch.sh | ssh -o "UserKnownHostsFile=/dev/null" -o "StrictHostKeyChecking=no" $PUBLIC_IP_ES_NODE_0 PUBLIC_IP_ES_NODE=$PUBLIC_IP_ES_NODE_0 PRIVATE_IP_ES_NODE=$PRIVATE_IP_ES_NODE_0 ES_NODE_NAME=$ES_NODE_0_NAME ES_NODE_PRIVATE_IPS="$ES_NODE_PRIVATE_IPS" ES_OTHER_NODE_IPS="$PRIVATE_IP_ES_NODE_1" "bash -s"
cat configure_elasticsearch.sh | ssh -o "UserKnownHostsFile=/dev/null" -o "StrictHostKeyChecking=no" $PUBLIC_IP_ES_NODE_1 PUBLIC_IP_ES_NODE=$PUBLIC_IP_ES_NODE_1 PRIVATE_IP_ES_NODE=$PRIVATE_IP_ES_NODE_1 ES_NODE_NAME=$ES_NODE_1_NAME ES_NODE_PRIVATE_IPS="$ES_NODE_PRIVATE_IPS" ES_OTHER_NODE_IPS="$PRIVATE_IP_ES_NODE_0" "bash -s"
cat configure_rally.sh | ssh -o "UserKnownHostsFile=/dev/null" -o "StrictHostKeyChecking=no" $PUBLIC_IP_LOADDRIVER TARGET_HOSTS="$TARGET_HOSTS" EVENTDATA_TRACK_REVISION="$EVENTDATA_TRACK_REVISION" RALLY_VERSION="$RALLY_VERSION" "bash -s"
printf "\n"
printf "\e[32mLoaddriver IP:\e[m $PUBLIC_IP_LOADDRIVER\n"
printf "\e[32mElasticsearch IPs:\e[m $PUBLIC_IP_ES_NODE_0 $PUBLIC_IP_ES_NODE_1\n"
#!/usr/bin/env bash
source variables.sh
gcloud compute \
--project=$GCP_PROJECT \
instances delete \
$LOADDRIVER_NAME \
${ES_NODE_NAMES[@]} \
--zone $GCP_ZONE \
--delete-disks=all \
--quiet
GCP_ZONE=${GCP_ZONE:-us-central1-a}
GCP_PROJECT=${GCP_PROJECT:?}
SSH_PUB_KEY=${SSH_PUB_KEY:-~/.ssh/id_rsa.pub}
SSH_USERNAME=${SSH_USERNAME:-$USER}
SSH_KEYS="${SSH_USERNAME}:$(cat ${SSH_PUB_KEY})"
IMAGE_NAME="ubuntu-1604-xenial-v20190807"
IMAGE_PROJECT="ubuntu-os-cloud"
LOADDRIVER_NAME=${LOADDRIVER_NAME:-${SSH_USERNAME}-corrupt-index-repro-loaddriver}
ES_NODE_0_NAME=${ES_NODE_0_NAME:-${SSH_USERNAME}-corrupt-index-repro-es-node-0}
ES_NODE_1_NAME=${ES_NODE_1_NAME:-${SSH_USERNAME}-corrupt-index-repro-es-node-1}
declare -a ES_NODE_NAMES=($ES_NODE_0_NAME $ES_NODE_1_NAME)
RALLY_VERSION="1.3.0"
EVENTDATA_TRACK_REVISION="b57c97b671010ab9a45b65850f76014464902ab4"
ELASTIC_PASSWORD="some-secret-password"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment