Skip to content

Instantly share code, notes, and snippets.

@w0rldart
Last active July 14, 2021 15:32
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save w0rldart/9747c7ba4b8e8787f285dade4136f91c to your computer and use it in GitHub Desktop.
Save w0rldart/9747c7ba4b8e8787f285dade4136f91c to your computer and use it in GitHub Desktop.
Chef cluster recovery notes

Intro

Chef High Availability: Backend Cluster and its not so common problems list.

Notes:

  • In my case, Chef HA setup is entirely on AWS but this can be translated to other vendors too
  • chef-backend-ctl commands are for backend nodes
  • chef-server-ctl commands are for frontend nodes

Index

Nodes hostname

Chef is known to be delicate about hostname configuration, and for that I put together this list of actions you can take to sort out this issue.

This should be done for all your Chef nodes

  • Manual check to see if the hostname matching its FQDN
hostname; hostname -A
  • Query AWS API to check instance's hostname and adjust hostname with the returned value
AWS_HOSTNAME=`curl http://169.254.169.254/latest/meta-data/hostname`;

echo ${AWS_HOSTNAME}

hostname ${AWS_HOSTNAME};
  • Adjust /etc/hosts if incomplete
EXT_IP=`ifconfig eth0 |grep "inet addr" |awk '{print $2}' |awk -F: '{print $2}'`

cat /etc/hosts | grep -qE "${EXT_IP}\s+${AWS_HOSTNAME}" &&\
echo "hostname already in there" || echo "${EXT_IP} ${AWS_HOSTNAME}" >> /etc/hosts
  • Check hostnames again
hostname; hostname -A
  • Reconfigure Chef with chef-backend-ctl reconfigure (Backend) or chef-server-ctl reconfigure (Frontend)

Restoring Chef cluster after total node failure

In the scenario where you had total follower node failure (all following nodes crashing) causing a loss of quorum, but you still have the leader "operational", you can recover following this process.

Note: My follower nodes are part of an AWS ASG, and in order to get the cluster working again I have to set the ASG size to 1 to get the cluster operational again, and then can set it to the desired number of nodes, to have a smooth join cluster proceedure and avoid a race condition.

Leader

rm /var/opt/chef-backend/leaderl/data/no-start-pgsql
chef-backend-ctl create-cluster --quorum-loss-recovery

Follower1

chef-backend-ctl join-cluster; chef-backend-ctl reconfigure

Leader

chef-backend-ctl reconfigure

FollowerN

chef-backend-ctl join-cluster ...
chef-backend-ctl reconfigure

Then you'll most certainly need to run through the next section of this document (fixing ES)

Chef Elasticsearch red status

Symptoms

A broken Chef ES index can cause all sort of funky things when you try and query data via knife or the UI . Some common symtoms are:

  • Searching for windows in the UI returns no or mixed results
  • knife node search 'platform:redhat' returns all results
  • curl localhost:9200/chef/_search -d '{"query":{"query_string":{"lowercase_expanded_terms":false,"query":"content:platform__=__redhat"}}}' on any of your Chef servers will return all results too

Additionally, checking elasticsearch status would return Red

chef-backend-ctl status elasticsearch
          Role:  Leader
  Local Status:  running (pid 32744)
       Logging:  running (pid 1892)
       Time up:  0d 0h 1m 17s
Cluster Status:  red
 Active Shards:  70.0%
                 ** Nodes **

Solution

Seems that the way to fix this issue is to wipe out the index and rebuild it from scratch, but note of warning, YOU WILL NEED TO BLOCK ALL ACCESS to Chef for a short while, which will create downtime and unability to use the service during that time.

This is the only way to ensure a fully and healthy rebuilt index!

The access point is usually the Frontend node and you could try one of the following methods to block the traffic in there:

  • Block incoming traffic via iptables on the machine
  • Use a Firewall/Security Group vendor service to block incoming traffic
  • If you're using a Load Balancer, unassign the nodes from it

Do not stop the Frontend node because you'll need it to rebuild the index.

In a nutshell, we will be doing the following:

  • Block traffic on the Frontend node
  • Delete the index from one of the Backends (ideally the leader)
  • Reconfigure services and reindex all data the Frontend node
  • Resume traffic on the Frontend node

Implementation

Checking Cluster status

Check that the cluster is still operative even if ES has a red status, and if not have a look at Restoring Chef cluster after node failure

$ chef-backend-ctl cluster-status
Name               IP              GUID                              Role      PG        ES
ip-10-10-181-87   10.10.181.87   c5bbb54df8f74213cac49b605404583e  follower  follower  not_master
ip-10-10-183-242  10.10.183.242  92bcc24ea62b8c2a492205ead2770eeb  leader    leader    not_master
ip-10-10-183-76   10.10.183.76   701581deb012cbcdbcca1a1c2e7f8edd  follower  follower  master
$ chef-backend-ctl status
Service        Local Status         Time in State  Distributed Node Status
leaderl        running (pid 9889)   0d 1h 37m 19s  leader: 1; waiting: 0; follower: 2; total: 3
etcd           running (pid 9685)   0d 1h 37m 52s  health: green; healthy nodes: 3/3
postgresql     running (pid 9974)   0d 1h 37m 16s  leader: 1; offline: 0; syncing: 0; synced: 2
elasticsearch  running (pid 10184)  0d 0h 22m 13s  state: red; nodes online: 3/3

List indices on ElasticSearch, chef should come up red

$ curl 'http://localhost:9200/_cat/indices?v'
health status index pri rep docs.count docs.deleted store.size pri.store.size
red    open   chef  5   1   575        5            48.9mb     24.4mb

Aditionally you can query the ElasticSearch for other relevant information with the following commands:

  • Chef cluster health
    • For a detailed output: curl 'localhost:9200/_cluster/health/chef?pretty&level=shards'
    • For a simplified output: curl 'localhost:9200/_cat/health?v'
  • Cluster State
    • curl 'http://localhost:9200/_cluster/state?pretty'
  • Shard states
    • List all shards curl 'localhost:9200/_cat/shards?v'
    • Filter by unassigned shards curl localhost:9200/_cat/shards?h=index,shard,prirep,state,unassigned.reason| grep UNASSIGNED

Deleting and restoring the index

Remember that traffic should be blocked on the Frontend node

On a Backend node (ideally the leader)

  • Delete the chef index
curl -XDELETE 'http://localhost:9200/chef'

{"acknowledged":true}

On a Frontend node

  • Reconfigure the Frontend node
chef-server-ctl reconfigure
  • Reindex all the data
chef-server-ctl reindex -a
  • Run a final check to see the size of the index
curl 'http://localhost:9200/_cat/indices?v'

health status index pri rep docs.count docs.deleted store.size pri.store.size
green  open   chef    5   1       1574           58    547.3mb        273.9mb
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment