ZECTBynmo/Keeping ES Alive.md

## Keeping ES Alive.md

      
    Raw
  

              Keeping ES Alive.md
            
          
    ES Playbooks

Some quick snippots for how to keep ES alive
Check ES Status

To check the general status of ES hit this url (while on the VPN)
https://egw0/_cluster/health?pretty
You'll get a JSON response like this:
{
    "cluster_name": "brdprod0",
    "status": "green",
    "timed_out": false,
    "number_of_nodes": 10,
    "number_of_data_nodes": 9,
    "active_primary_shards": 373,
    "active_shards": 1118,
    "relocating_shards": 0,
    "initializing_shards": 0,
    "unassigned_shards": 0
}

That's an example of a healthy cluster - notice the "green" status. When something is going wrong, that will likely be either "yellow" or "red" depending on how bad things are.
Dealing with a node failure

Generally there are two important metrics to pay attention to: number_of_nodes and unassigned_shards. Often the failure mode we see is the one of the nodes drops out of circulation for whatever reason, causing the number_of_nodes metric to fall below 10 (we have 10 nodes, es0-es7 + egw0 + egw1). That will generally also cause the number of unassigned_shards to jump upwards, and the status to flip to yellow or red.
How to fix:

Figure out which node
a. SSH into each node and run curl localhost:9200/_cluster/health?pretty
b. if the request takes more than 10 seconds, or generall times out, this is one of the broken nodes
Restart elasticsearch on that node (stop and start the service, do NOT run restart)
a. sudo su root
b. service elasticsearch stop
c. service elasticsearch start
d. tail the log until the node comes back online tail /var/log/elasticsearch/brdprod0.log -f

Once the node comes back online the number of unassigned shards should gradually fall. Often it takes some time - on the order of 10 minutes - for nodes to come fully online and assign all shards.
Dealing with total cluster outage

Sometimes we lose the entire cluster, meaning there's 0 shards showing up in the stats (0 unassigned, 0 active, 0 provisioning, etc). In this case it's generally one of the master nodes (egw0 or egw1) that's having an issue.
Generally, we need to restart (stop + start) ES on one or both of the master nodes, and then do the same on all of the data nodes. The logs may have additional information about what's going wrong - maybe we're out of disk space on one or more of the nodes, etc.
If all else fails, it may be worth trying to stop ES on ALL nodes, and then starting things back up in sequence - master nodes first, then data nodes.