Skip to content

Instantly share code, notes, and snippets.

@labrown
Last active September 22, 2023 12:11
Show Gist options
  • Star 50 You must be signed in to star a gist
  • Fork 15 You must be signed in to fork a gist
  • Save labrown/5341ebec47bfba6dd7d4 to your computer and use it in GitHub Desktop.
Save labrown/5341ebec47bfba6dd7d4 to your computer and use it in GitHub Desktop.
Ansible rolling restart of Elasticsearch Cluster
---
###
# Elasticsearch Rolling restart using Ansible
###
##
## Why is this needed?
##
#
# Even if you use a serial setting to limit the number of nodes processed at one
# time, Ansible will restart elasticsearch nodes and continue processing as soon as
# the elasticsearch service restart reports itself complete. This pushes the cluster
# into a red state due to muliple data nodes being restarted at once, and can cause
# performance problems.
#
##
## Solution:
##
#
# Perform a rolling restart of the elasticsearch nodes and wait for the cluster
# to stabilize before continuing processing nodes.
#
##
## NOTES
##
#
# Our elasticsearch role and these handlers assume Ansible is running on one
# of the Elasticsearch cluster nodes (see the use of "localhost:9200" below)
#
# Variable definitions
#
# service - Our cluster uses three types of nodes:
# 'search' - Provides the front end Elasticsearch api to our applications
# 'ossec' - Runs on our ossec masters to for input of ossec-hids logs and alerts
# 'data' - Holds the cluster data
#
# 'search' and 'ossec' nodes are eligible to be masters.
#
# The elasticsearch role is called with the service variable set appropriately.
#
# Each node in our cluster is 'node.name'd {{ansible_hostname}}-{{service}} in its
# elasticsearch.yml configuration file.
#
# The handlers are chained together using notify to perform the following process:
#
# 1. Disable shard allocation on the cluster
# 2. Restart the elasticsearch node process
# 3. Wait for the node to rejoin the cluster
# 4. Enable shard allocation on the cluster
# 5. Wait for the cluster state to become green
#
# This keeps the cluster from entering a red state.
# Restart Elasticsearch node
# This handler is the entrance point for the chained set of handers
- name: restart elasticsearch-{{service}}
debug: msg="Rolling restart of node {{ansible_hostname}}-{{service}} initiated"
changed_when: true
notify:
- cluster routing none {{ansible_hostname}}-{{service}}
# Turn off shard allocation
- name: cluster routing none {{ansible_hostname}}-{{service}}
local_action: "shell curl -XPUT localhost:9200/_cluster/settings -d '{\"transient\" : {\"cluster.routing.allocation.enable\" : \"none\" }}'"
register: result
until: result.stdout.find('"acknowledged"') != -1
retries: 200
delay: 3
changed_when: result.stdout.find('"acknowledged":true') != -1
notify:
- do restart {{ansible_hostname}}-{{service}}
# restart elasticsearch process
- name: do restart {{ansible_hostname}}-{{service}}
service: name=elasticsearch-{{service}} state=restarted
notify:
- wait node {{ansible_hostname}}-{{service}}
# Wait for Elasticsearch node to come back into cluster
- name: wait node {{ansible_hostname}}-{{service}}
local_action: "shell curl -s -m 2 'localhost:9200/_cat/nodes?h=name' | tr -d ' ' | grep -E '^{{ansible_hostname}}-{{service}}$' "
register: result
until: result.rc == 0
retries: 200
delay: 3
notify:
- cluster routing all {{ansible_hostname}}-{{service}}
# Turn on shard allocation
- name: cluster routing all {{ansible_hostname}}-{{service}}
local_action: "shell curl -s -m 2 -XPUT localhost:9200/_cluster/settings -d '{\"transient\" : {\"cluster.routing.allocation.enable\" : \"all\" }}'"
register: result
until: result.stdout.find("acknowledged") != -1
retries: 200
delay: 3
changed_when: result.stdout.find('"acknowledged":true') != -1
notify:
- wait green {{ansible_hostname}}-{{service}}
tags:
- service
# Wait until cluster status is green
- name: wait green {{ansible_hostname}}-{{service}}
local_action: shell curl -s -m 2 localhost:9200/_cat/health | cut -d ' ' -f 4
register: result
until: result.stdout.find("green") != -1
retries: 200
delay: 3
@labrown
Copy link
Author

labrown commented Sep 27, 2021

To be honest, I don't remember why I put this gist up. At the time, I was working heavily with elasticsearch and needed an ansible role that would reliably restart a cluster without downtime. I no longer work at the firm where I did this and don't have access to any of the other source code.

@ginolegigot
Copy link

All right thank you for your answer!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment