Skip to content

Instantly share code, notes, and snippets.

@rottenbytes
Last active August 29, 2015 14:17
Show Gist options
  • Save rottenbytes/cd1afeea624d7a13c982 to your computer and use it in GitHub Desktop.
Save rottenbytes/cd1afeea624d7a13c982 to your computer and use it in GitHub Desktop.

Elasticsearch has many metrics that can be used to determine if a cluster is healthy. Listed below are the metrics that are currently a good idea to monitor with the reason(s) why they should be monitored and any possible recourse for issues.

Version

Unless otherwise noted, all of the API requests work starting with 1.0.0. If a newer version is required for a given metric, then it is noted by the metric's name.

Metrics

Metrics are an easy way to monitor the health of a cluster and they can be easily accessed from the HTTP API. Each Metrics table is broken down by their source.

Each metric has an associated warning level and error level. These levels are meant to indicate when the associated Metric should be monitored versus acted upon as soon as possible.

Further reading that adds to the details found here can be found in the Elasticsearch guide about monitoring individual nodes relevant to the latest version of Elasticsearch.

Human Readable Metrics /_cat

With the release of 1.0.0 came the introduction of the _cat APIs. These APIs provide a human element to metrics by providing a more tabular format to read. At any time, you can quickly find all available _cat APIs by making a request to the top level _cat endpoint:

# Find all available _cat APIs available to you
# (this was run on Elasticsearch 1.2.2):
curl -XGET localhost:9200/_cat
=^.^=
/_cat/allocation
/_cat/shards
/_cat/shards/{index}
/_cat/master
/_cat/nodes
/_cat/indices
/_cat/indices/{index}
/_cat/segments
/_cat/segments/{index}
/_cat/count
/_cat/count/{index}
/_cat/recovery
/_cat/recovery/{index}
/_cat/health
/_cat/pending_tasks
/_cat/aliases
/_cat/aliases/{alias}
/_cat/thread_pool
/_cat/plugins
/_cat/fielddata
/_cat/fielddata/{fields}

As you read through the rest of this document, feel free to cross reference those requests with the options found here. For example:

# Ensure that both data nodes see the same master
curl localhost:9200/_cat/master?v
id                     host        ip           node    
8Y_a2JvMRnGGX3ypdL85Dw my-hostname 192.168.1.31 Amergin

# Check other data nodes
curl localhost:9201/_cat/master?v
id                     host        ip           node    
8Y_a2JvMRnGGX3ypdL85Dw my-hostname 192.168.1.31 Amergin

# Check all nodes in the cluster
curl localhost:9200/_cat/nodes?v
host        ip           heap.percent ram.percent load node.role master name     
my-hostname 192.168.1.31           12          75 3.84 d         *      Amergin  
my-hostname 192.168.1.31           18          75 3.84 d         m      Deathlok 

The _cat/master API tells you on a per-node basis who it sees as its master node (notice the port changes). This is generally much easier to check, as a human, than the equivalent /_cluster/state/master_node request. Details on the various _cat APIs can be found here.

Node Info /_nodes/

Node information represents details about individual nodes in a cluster that are critical to stability.

All API calls in this table must be prefixed by /_nodes/{api-call}. Each JSON Path represents the relevant field starting at the returned node sub-object (nodes.{node}.{json-path}) from the specified API Call.

These metrics are currently only concerned with versions. Except during upgrades (when it's expected temporarily), you should always use the same version on each node.

# This is how all of these commands should look:
curl -XGET localhost:9200/_nodes/{api-call}?human\&pretty

For example:

# Java Version
curl -XGET localhost:9200/_nodes/jvm?human\&pretty -d '{
  "cluster_name" : "elasticsearch-cluster-name",
  "nodes" : {
    "8Y_a2JvMRnGGX3ypdL85Dw" : {
      "name" : "Amergin",
      "transport_address" : "inet[my-hostname/192.168.1.31:9300]",
      "host" : "my-hostname",
      "ip" : "192.168.1.31",
      "version" : "1.2.2",
      "build" : "9902f08",
      "http_address" : "inet[/192.168.1.31:9200]",
      "jvm" : {
        "pid" : 16669,
        "version" : "1.8.0_20",
        "vm_name" : "Java HotSpot(TM) 64-Bit Server VM",
        "vm_version" : "25.20-b23",
        "vm_vendor" : "Oracle Corporation",
        "start_time" : "2014-11-18T20:24:00.063Z",
        "start_time_in_millis" : 1416342240063,
        "mem" : {
          "heap_init" : "256mb",
          "heap_init_in_bytes" : 268435456,
          "heap_max" : "990.7mb",
          "heap_max_in_bytes" : 1038876672,
          "non_heap_init" : "2.4mb",
          "non_heap_init_in_bytes" : 2555904,
          "non_heap_max" : "0b",
          "non_heap_max_in_bytes" : 0,
          "direct_max" : "990.7mb",
          "direct_max_in_bytes" : 1038876672
        },
        "gc_collectors" : [ "ParNew", "ConcurrentMarkSweep" ],
        "memory_pools" : [ "Code Cache", "Metaspace", "Compressed Class Space", "Par Eden Space", "Par Survivor Space", "CMS Old Gen" ]
      }
    }
  }
}'

To access the Java Version, nodes.8Y_a2JvMRnGGX3ypdL85Dw.jvm.version would be the expansion for nodes.{node}.{json-path}.

Metric API Call JSON Path Explanation Solution
Java Version jvm jvm.version Elasticsearch is written in Java. In rare cases, intercommunication between nodes will use the features that get tweaked between Java releases, which can cause issues when the Java version does not match on both sides of the communication. As such, not maintaining the same version can lead to unexpected failures at this level. Maintain the same version of Java installed on all servers running Elasticsearch.
Elasticsearch Version   version Elasticsearch is written with backward compatibility in mind, but sometimes non-backward compatible changes will occur to support desired features and improve performance. Maintain the same version of Elasticsearch installed on all nodes.

Stats /_nodes/stats/

In the case of memory and file operations, distributing the workload is also a potential solution to capacity constraints.

All API calls in this table must be prefixed by /_nodes/stats/{api-call}. Each JSON Path represents the relevant field starting at the returned node sub-object (nodes.{node}.{json-path}) from the specified API Call.

# This is how all of these commands should look:
curl -XGET localhost:9200/_nodes/stats/{api-call}?human\&pretty

For example:

# File Descriptors
curl -XGET localhost:9200/_nodes/stats/process?human\&pretty -d '{
  "cluster_name" : "elasticsearch-cluster-name",
  "nodes" : {
    "8Y_a2JvMRnGGX3ypdL85Dw" : {
      "timestamp" : 1416458421494,
      "name" : "Amergin",
      "transport_address" : "inet[my-hostname/192.168.1.31:9300]",
      "host" : "my-hostname",
      "ip" : [ "inet[my-hostname/192.168.1.31:9300]", "NONE" ],
      "process" : {
        "timestamp" : 1416458421494,
        "open_file_descriptors" : 435,
        "cpu" : {
          "percent" : 1,
          "sys" : "7m",
          "sys_in_millis" : 420451,
          "user" : "32.1m",
          "user_in_millis" : 1927525,
          "total" : "39.1m",
          "total_in_millis" : 2347976
        },
        "mem" : {
          "resident" : "344.9mb",
          "resident_in_bytes" : 361709568,
          "share" : "-1b",
          "share_in_bytes" : -1,
          "total_virtual" : "4.8gb",
          "total_virtual_in_bytes" : 5162328064
        }
      }
    }
  }
}'

To access the File Descriptors, nodes.8Y_a2JvMRnGGX3ypdL85Dw.process.open_file_descriptors would be the expansion for nodes.{node}.{json-path}.

Metric API Call JSON Path Warning Error Explanation Solution
Java GC CMS jvm jvm.gc.old.collection_count Increased by greater than X per minute. Increased by greater than X per minute. The number of Concurrent Mark Sweep collections that run per minute should stay roughly the same on a healthy cluster, which should be used to determine X. Bursts in server load should be reflected by seeing bursts in collections, but they should stabilize with the load. On a cluster that is constantly needing to do more-and-more collects, the risk is that more-and-more time is being spent doing garbage collections rather than processing. In worst case scenarios, this can lead to slow responses and out of memory issues that eventually lead to failed nodes. Increase the maximum heap setting for Elasticsearch. In some cases, this may require actually adding more memory to the server.
Available Disk Space fs/data fs.data.available_in_bytes 20% of total disk space left 10% of total disk space left Running out of disk space means that nothing can be inserted or updated. As a result, the node will fail. Add more disk space.
File Descriptors process process.open_file_descriptors 70% of the maximum number of file descriptors 90% of the maximum number of file descriptors File descriptors are used for connections and file operations. As Elasticsearch grows and scales, this number will increase, particularly when it is under heavy load. If this number reaches the maximum, then new connections and file operations cannot occur until old ones have closed. This will cause intermittent node failures.

The current maximum value can be read by calling /_nodes/process and reading nodes.process.max_file_descriptors; the maximum value will not change after the node has started.

Increase the system's maximum file descriptor count, which is OS specific (see ulimit for many Linux distributions).
Java Heap Size jvm jvm.mem.heap_used_percent 80% of total heap for 10 minutes 90% of total heap for 10 minutes The Java Virtual Machine (JVM) heap is the main memory used by the Java processes, which includes Elasticsearch. Like any other process, if it runs out of memory, then it will crash and lead to node failures. Increase the maximum heap setting for Elasticsearch. In some cases, this may require actually adding more memory to the server.
HTTP Connections http http.total_opened Increases by greater than 50 per minute. Increases by greater than 100 per minute. The number of HTTP connections can be indicative of server demand by request, but not by content. Some requests are much easier to fulfill than others, but having too many requests--simple or complex--can cause a node issues. Better distributing a workload (adding more nodes) is the easiest way to reduce the number of connections to a particular node. In some cases, it may be possible to tune client software to send fewer requests per minute if the software is written to poll Elasticsearch on an interval. Naturally, avoiding the need to make requests is the easiest way to reduce the number that come in, such as duplicating requests unnecessary.

Many clients, such as the PHP client and Javascript client, create connections for every request. In those cases, you just cannot avoid it, but in other cases it can be avoided. For instance, using the .NET client, it will default to HTTP pipelining, which allows a single HTTP connection to be reused. Using appropriate Keep Alive times, persistent connections, and pipelining from the client can greatly reduce the number of connections, which can reduce network overhead.

For those clients that do not allow those features to be controlled, it can help to setup a proxy for Elasticsearch (e.g., using Nginx) so that the proxy can use those features, which allows you to continue using "harmful" clients while still reducing the number of overall connections.

Thread Pool Rejections thread_pool thread_pool.POOL.rejected Increased by greater than X per minute. Increased by greater than X per minute. The number of rejected threads per minute should stay roughly the same on a healthy cluster, which should be used to determine X. A rejected thread means the requested action did not occur (e.g., nodes.NODE.thread_pool.get.rejected indicates failed get requests) because the associated thread pool was full; it does not mean that it started, but failed later. Adding more available threads to the problematic thread pool can lower the number of rejections, but, in general, this should only be done when directed by support. Providing access to more processing power or processors can ease thread pool congestion by allowing threads to finish more quickly or more in parallel.
Thread Pool Queue thread_pool thread_pool.POOL.queue Increased by greater than X per minute. Increased by greater than X per minute. The number of threads per minute should stay roughly the same on a healthy cluster, which should be used to determine X. Optimally, this value should be 0, but peak periods may reasonably see threads queued. A queued thread means the requested action has not occurred yet (e.g., nodes.NODE.thread_pool.get.queue indicates delayed get requests) because the associated thread pool was full; it means that it is waiting to be processed and it has not yet been rejected. Adding more available threads to the problematic thread pool can lower the number of queued threads, but, in general, this should only be done when directed by support. Providing access to more processing power or processors can ease thread pool congestion by allowing threads to finish more quickly or more in parallel.
Load Average os os.load_average.AVERAGE Relative to nodes.NODE.os.available_processors from /_nodes/os. Relative to nodes.NODE.os.available_processors from /_nodes/os. The average processor load on the node. As the load approaches complete utilization of each processor, it means other server processes are not being executed and that likely some threads within the process may be executing more slowly.

The load_average value is an array with the first element representing the average load for the past minute, the second element representing the average load for the previous 5 minutes, and the third (and last) element representing the average load for the past 15 minutes.

Adding more available processors to the server chassis or virtual machine. Providing access to more processing power or processors can ease thread pool congestion by allowing threads to finish more quickly or more in parallel.
Filter Cache Size indices/filter_cache indices.filter_cache.memory_size_in_bytes Total sum of cache sizes is greater than 60% of heap size. Total sum of cache sizes is greater than 70% of heap size. Elasticsearch uses caches to speed up frequently performed actions. If the caches take up too much memory, then it is possible to get into situations where the rest of the Elasticsearch process is waiting for memory to become available, which may cause actions to run slower.

If the warning or error levels are reached, then lowering the level of the worst offending or least used caches can help to speed up Elasticsearch.

Note: The ID Cache is the in-memory-join table maintaining Parent/Child relationships. There is not currently a setting to control the amount of memory used to maintain this relationship and there is not much that can be done to effect its footprint. Because it resides on the heap, it is still a good idea to monitor its usage. Starting in 1.1.0, the ID Cache is actually stored in the Field Data Cache, but both metrics are still reported separately.

Each cache can be independently cleared, or you may choose to clear them all at the same time using the clear cache API. If levels are set to allow the behavior causing concern, then clearing the cache may just be delaying the problem.

Updating the specific settings for each cache can be done to control situations that consistently cause issue. For example, setting the indices.cache.filter.size upon startup can limit the memory used by the filter cache.

ID Cache Size indices/id_cache indices.id_cache.memory_size_in_bytes
Field Data Size indices/fielddata indices.fielddata.memory_size_in_bytes
Percolate Size indices/percolate indices.percolate.memory_size_in_bytes
Query Cache Size (>= 1.4.0) indices/query_cache indices.query_cache.memory_size_in_bytes

Cluster Details /_cluster

The overall health of the cluster is an important aspect of Elasticsearch deployments that have more than a single shard. Understanding these core metrics can lead to a more stable deployment.

All API calls in this table must be prefixed by /_cluster/{api-call}. Each JSON Path represents the relevant field starting at the returned object's root (just {json-path}) from the specified API Call.

# This is how all of these commands should look:
curl -XGET localhost:9200/_cluster/{api-call}?pretty

For example:

# Status
curl -XGET localhost:9200/_cluster/health?pretty -d '{
  "cluster_name" : "elasticsearch-cluster-name",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 26,
  "active_shards" : 26,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 26
}'

To access the Status, status would be the expansion for {json-path}.

Metric API Call JSON Path Warning Error Explanation Solution
Status health status The status is "yellow" for more than 5 minutes. The status is "red". "green" is the desired status, which indicates a healthy cluster that has properly allocated all primary and replica shards. A "yellow" status indicates that at least one replica is missing, but all data is still searchable. The worst is the "red" status that indicates at least one primary shard, as well as its replicas, is missing; this means that searches will return partial results and indexing to the missing shard(s) will cause an exception. Investigate the cause of any node, shard or replica issues by checking the logs and monitoring other metrics. Once recovered, it is important to ensure that the original problem does not repeat itself. For example, if a node failed because its disk was full, then restarting it will not prevent the issue from immediately recurring.
Data Nodes health number_of_data_nodes The value is less than expected. Elasticsearch clusters are only healthy if all data nodes are available. If data nodes are missing, then it is possible to only get partial results from queries as well as to fail to index data that would otherwise be put there.
Master Node state/master_node master_node The value is different for any node. If any node is in disagreement about the master_node, then problems can quickly occur because the cluster is not in a safe state where they do agree on the master_node. A disagreement on the master_node is called a "split brain".

In a split brain situation, there is effectively more than one cluster because multiple nodes are still running, but some cannot talk to others. Each cluster thinks that it is _the_ cluster and behaves like the other nodes are simply missing from it. The best way to detect a split brain is by checking /_cluster/state/master_node?local against _every_ node. Each node will their master node. If any nodes disagree, then that represents a split brain. (Marvel does this for you automatically!)

Under the worst circumstances, outside connections are able to see _each_ side and the assumption is that they are working together in the background. In this scenario, it is possible to cause issues

Stop the problematic nodes as soon as possible to avoid issues. Before restarting any nodes, ensure that network connectivity between all nodes is working properly so that intercommunication can happen.

Index Details /_stats?level=cluster

The overall health of an index is an important aspect of all searches, indexes (writes), and retrievals. If an index is having problems, then all users of the index will be directly effected.

All API results in this table are the result of /_stats?level=cluster (not specifying the level shows all indices by default, which is unnecessary). Each JSON Path represents the relevant field starting at the returned node sub-object (_all.total.{json-path}) from the API Call.

To determine the relative warning/error state, you must monitor values between lookups, looking for relatively large variance.

# This is how all of these commands should look:
curl -XGET localhost:9200/_stats?level=cluster\&human\&pretty

For example:

# Total Search Requests
curl -XGET localhost:9200/_stats?level=cluster\&human\&pretty -d '{
  "_shards" : {
    "total" : 52,
    "successful" : 26,
    "failed" : 0
  },
  "_all" : {

    ... removed for brevity ...

    "total" : {
      "docs" : {
        "count" : 495599,
        "deleted" : 0
      },
      "store" : {
        "size" : "362.4mb",
        "size_in_bytes" : 380072547,
        "throttle_time" : "2.1m",
        "throttle_time_in_millis" : 128652
      },

      ... removed for brevity ...

      "search" : {
        "open_contexts" : 0,
        "query_total" : 8664,
        "query_time" : "16.4s",
        "query_time_in_millis" : 16458,
        "query_current" : 0,
        "fetch_total" : 8664,
        "fetch_time" : "2.4s",
        "fetch_time_in_millis" : 2484,
        "fetch_current" : 0
      },

      ... removed for brevity ...

    }
  }
}'

To access the Total Search Requests, _all.total.search.query_total would be the expansion for _all.total.{json-path}.

Metric JSON Path Explanation Solution
Total Search Requests search.query_total The total number of queries (searches). Determine the cause for the sudden surge or drop in queries. It is possible that a connected application has lost all network connectivity to the cluster. Surges in queries could be innocent, unintentionally looped queries in application code, or the sign of possible abuse through connected applications.
Total Search Request Time search.query_total_in_millis The total time spent on queries (searches) in milliseconds. In addition to looking at the number of requests, the complexity of requests is important. The search slow log should be checked for unexpected increases in the total time. The slow log must be manually enabled.
Total Index Requests indexing.index_total The total number of indexes (writes). Determine the cause for the sudden surge or drop in indexes. It is possible that a connected application has lost all network connectivity to the cluster. Surges in indexing could be innocent, unintentionally looped operations in application code, or the sign of possible abuse through connected applications.
Total Index Request Time indexing.index_total_in_millis The total time spent on indexing (writing) in milliseconds. In addition to looking at the number of requests, the complexity of requests is important. The index slow log should be checked for unexpected increases in the total time. The slow log must be manually enabled.
Successful Get Requests get.exists_total The total number of get requests. Determine the cause for the sudden surge or drop in get requests. It is possible that a connected application has lost all network connectivity to the cluster. Surges in get requests could be innocent, unintentionally looped operations in application code, or the sign of possible abuse through connected applications (e.g., constantly reloading a webpage that internally does a get request for document(s)).
Successful Get Request Time get.exists_total_in_millis The total time spent on get requests in milliseconds. Get requests can be performed in real time and they should not show change dramatic changes. Sudden changes should be investigated with regard to the overall health of shards, indices, and clusters.
Missed Get Requests get.missing_total The total number of get request misses. Determine the cause for the sudden surge or drop in get requests. It is possible that a connected application has lost all network connectivity to the cluster. Surges in get requests could be innocent, unintentionally looped operations in application code, or the sign of possible abuse through connected applications (e.g., constantly reloading a webpage that internally does a get request for document(s)).

The most obvious cause of frequently missed get requests are made by applications that have predictable IDs defined via a given pattern (e.g., associated with a user's ID) and indices that do not always have documents defined. Depending on the application, this may or may not be an issue.

Missed Get Request Time get.missing_total_in_millis The total time spent on get request misses in milliseconds. Get requests can be performed in real time and they should not show dramatic changes. Sudden changes should be investigated with regard to the overall health of shards, indices, and clusters.

Elasticsearch Log Monitoring

Some information is available in Elasticsearch's logs. The following table provide some of the log output that should trigger a warning or alert based on the assumed severity.

The log line should be loosely interpreted as a regular expression, so .* effectively represents a placeholder.

Metric Monitored Text Seriousness Explanation Solution
Out of Memory java.lang.OutOfMemoryError Error The running node ran out of memory. Determine the cause of the error, and adjust the memory issues accordingly (e.g., add more memory to the server or adjust caches).

Be sure to use the appropriate Java heap size for your environment. The default Java heap size is 1 GB, which is great for development, but not for production in general.

File Descriptor Issues java.io.*Exception .* (Too many open files) Error Elasticsearch has tried to open too many file descriptors. This generally means that you need to adjust the OS-level file descriptor settings to increase the number available to a single process.
Internal Communication Failures java.io.StreamCorruptedException: invalid internal transport message format Error Elasticsearch communication failed internally between nodes. The solution depends largely on the reason for the failure. The most common failure is because nodes are running on different JVM versions.
Corrupted Translog failed to retrieve translog after .* operations, ignoring the rest, considered corrupted Error Elasticsearch cannot parse a translog, which could mean data loss has occurred after a failed restart. The solution depends largely on the reason for the failure. Contact support.
Lucene Merge Issues org.apache.lucene.index.MergePolicy$MergeException Warning An issue occurred while merging Lucene indices. This occurs related to a single Elasticsearch shard. The solution depends largely on the reason for the failure. Merging usually occurs automatically in the background and this could just be a non-issue. If the problem persists, then contact support.
Lucene Index Issues org.apache.lucene.index.CorruptIndexException Warning An issue occurred while reading a Lucene index. This occurs related to a single Elasticsearch shard. The solution depends largely on the reason for the failure. If the problem persists, then contact support.
Low Disk Space After allocating, node .* would have less than the required .*% free disk threshold .*, preventing allocation Warning The specified node is low on disk space. Allocate more disk space to the specified node.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment