rottenbytes/es-metrics.md

## es-metrics.md

      
    Raw
  

              es-metrics.md
            
          
    Elasticsearch has many metrics that can be used to determine if a cluster is healthy. Listed below are the metrics that are currently a good idea to monitor with the reason(s) why they should be monitored and any possible recourse for issues.
Version

Unless otherwise noted, all of the API requests work starting with 1.0.0. If a newer version is required for a given metric, then it is noted by the metric's name.
Metrics

Metrics are an easy way to monitor the health of a cluster and they can be easily accessed from the HTTP API. Each Metrics table is broken down by their source.
Each metric has an associated warning level and error level. These levels are meant to indicate when the associated Metric should be monitored versus acted upon as soon as possible.
Further reading that adds to the details found here can be found in the Elasticsearch guide about
monitoring individual nodes relevant to the latest version of Elasticsearch.
Human Readable Metrics /_cat

With the release of 1.0.0 came the introduction of the _cat APIs. These APIs provide a human element to metrics by providing a more tabular format to read. At any time, you can quickly find all available _cat APIs by making a request to the top level _cat endpoint:
# Find all available _cat APIs available to you
# (this was run on Elasticsearch 1.2.2):
curl -XGET localhost:9200/_cat
=^.^=
/_cat/allocation
/_cat/shards
/_cat/shards/{index}
/_cat/master
/_cat/nodes
/_cat/indices
/_cat/indices/{index}
/_cat/segments
/_cat/segments/{index}
/_cat/count
/_cat/count/{index}
/_cat/recovery
/_cat/recovery/{index}
/_cat/health
/_cat/pending_tasks
/_cat/aliases
/_cat/aliases/{alias}
/_cat/thread_pool
/_cat/plugins
/_cat/fielddata
/_cat/fielddata/{fields}
As you read through the rest of this document, feel free to cross reference those requests with the options found here. For example:
# Ensure that both data nodes see the same master
curl localhost:9200/_cat/master?v
id                     host        ip           node    
8Y_a2JvMRnGGX3ypdL85Dw my-hostname 192.168.1.31 Amergin

# Check other data nodes
curl localhost:9201/_cat/master?v
id                     host        ip           node    
8Y_a2JvMRnGGX3ypdL85Dw my-hostname 192.168.1.31 Amergin

# Check all nodes in the cluster
curl localhost:9200/_cat/nodes?v
host        ip           heap.percent ram.percent load node.role master name     
my-hostname 192.168.1.31           12          75 3.84 d         *      Amergin  
my-hostname 192.168.1.31           18          75 3.84 d         m      Deathlok 
The _cat/master API tells you on a per-node basis who it sees as its master node (notice the port changes). This is generally much easier to check, as a human, than the equivalent /_cluster/state/master_node request. Details on the various _cat APIs can be found here.
Node Info /_nodes/

Node information represents details about individual nodes in a cluster that are critical to stability.
All API calls in this table must be prefixed by /_nodes/{api-call}. Each JSON Path represents the relevant field starting at the returned node sub-object (nodes.{node}.{json-path}) from the specified API Call.
These metrics are currently only concerned with versions. Except during upgrades (when it's expected temporarily), you should always use the same version on each node.
# This is how all of these commands should look:
curl -XGET localhost:9200/_nodes/{api-call}?human\&pretty
For example:
# Java Version
curl -XGET localhost:9200/_nodes/jvm?human\&pretty -d '{
  "cluster_name" : "elasticsearch-cluster-name",
  "nodes" : {
    "8Y_a2JvMRnGGX3ypdL85Dw" : {
      "name" : "Amergin",
      "transport_address" : "inet[my-hostname/192.168.1.31:9300]",
      "host" : "my-hostname",
      "ip" : "192.168.1.31",
      "version" : "1.2.2",
      "build" : "9902f08",
      "http_address" : "inet[/192.168.1.31:9200]",
      "jvm" : {
        "pid" : 16669,
        "version" : "1.8.0_20",
        "vm_name" : "Java HotSpot(TM) 64-Bit Server VM",
        "vm_version" : "25.20-b23",
        "vm_vendor" : "Oracle Corporation",
        "start_time" : "2014-11-18T20:24:00.063Z",
        "start_time_in_millis" : 1416342240063,
        "mem" : {
          "heap_init" : "256mb",
          "heap_init_in_bytes" : 268435456,
          "heap_max" : "990.7mb",
          "heap_max_in_bytes" : 1038876672,
          "non_heap_init" : "2.4mb",
          "non_heap_init_in_bytes" : 2555904,
          "non_heap_max" : "0b",
          "non_heap_max_in_bytes" : 0,
          "direct_max" : "990.7mb",
          "direct_max_in_bytes" : 1038876672
        },
        "gc_collectors" : [ "ParNew", "ConcurrentMarkSweep" ],
        "memory_pools" : [ "Code Cache", "Metaspace", "Compressed Class Space", "Par Eden Space", "Par Survivor Space", "CMS Old Gen" ]
      }
    }
  }
}'
To access the Java Version, nodes.8Y_a2JvMRnGGX3ypdL85Dw.jvm.version would be the expansion for nodes.{node}.{json-path}.

  
      Metric
      API Call
      JSON Path
      Explanation
      Solution
    
  
      Java Version
      jvm
      jvm.version
      
        Elasticsearch is written in Java. In rare cases, intercommunication between nodes will use the features that get
        tweaked between Java releases, which can cause issues when the Java version does not match on both sides of the
        communication. As such, not maintaining the same version can lead to unexpected failures at this level.
      
      
        Maintain the same version of Java installed on all servers running Elasticsearch.
      
    
      Elasticsearch Version
       
      version
      
        Elasticsearch is written with backward compatibility in mind, but sometimes non-backward compatible changes will occur to support
        desired features and improve performance.
      
      
        Maintain the same version of Elasticsearch installed on all nodes.
      
    
Stats /_nodes/stats/

In the case of memory and file operations, distributing the workload is also a potential solution to capacity constraints.
All API calls in this table must be prefixed by /_nodes/stats/{api-call}. Each JSON Path represents the relevant field starting at the
returned node sub-object (nodes.{node}.{json-path}) from the specified API Call.
# This is how all of these commands should look:
curl -XGET localhost:9200/_nodes/stats/{api-call}?human\&pretty
For example:
# File Descriptors
curl -XGET localhost:9200/_nodes/stats/process?human\&pretty -d '{
  "cluster_name" : "elasticsearch-cluster-name",
  "nodes" : {
    "8Y_a2JvMRnGGX3ypdL85Dw" : {
      "timestamp" : 1416458421494,
      "name" : "Amergin",
      "transport_address" : "inet[my-hostname/192.168.1.31:9300]",
      "host" : "my-hostname",
      "ip" : [ "inet[my-hostname/192.168.1.31:9300]", "NONE" ],
      "process" : {
        "timestamp" : 1416458421494,
        "open_file_descriptors" : 435,
        "cpu" : {
          "percent" : 1,
          "sys" : "7m",
          "sys_in_millis" : 420451,
          "user" : "32.1m",
          "user_in_millis" : 1927525,
          "total" : "39.1m",
          "total_in_millis" : 2347976
        },
        "mem" : {
          "resident" : "344.9mb",
          "resident_in_bytes" : 361709568,
          "share" : "-1b",
          "share_in_bytes" : -1,
          "total_virtual" : "4.8gb",
          "total_virtual_in_bytes" : 5162328064
        }
      }
    }
  }
}'
To access the File Descriptors, nodes.8Y_a2JvMRnGGX3ypdL85Dw.process.open_file_descriptors would be the expansion for nodes.{node}.{json-path}.

  
      Metric
      API Call
      JSON Path
      Warning
      Error
      Explanation
      Solution
    
  
      Java GC CMS
      jvm
      jvm.gc.old.collection_count
      Increased by greater than X per minute.
      Increased by greater than X per minute.
      
        The number of Concurrent Mark Sweep collections
        that run per minute should stay roughly the same on a healthy cluster, which should be used to determine X. Bursts
        in server load should be reflected by seeing bursts in collections, but they should stabilize with the load. On a cluster that is
        constantly needing to do more-and-more collects, the risk is that more-and-more time is being spent doing garbage collections
        rather than processing. In worst case scenarios, this can lead to slow responses and out of memory issues that eventually lead
        to failed nodes.
      
      
        Increase the maximum heap setting for Elasticsearch. In some cases, this may require actually adding more memory to the server.
      
    
      Available Disk Space
      fs/data
      fs.data.available_in_bytes
      20% of total disk space left
      10% of total disk space left
      Running out of disk space means that nothing can be inserted or updated. As a result, the node will fail.
      Add more disk space.
    
    
      File Descriptors
      process
      process.open_file_descriptors
      70% of the maximum number of file descriptors
      90% of the maximum number of file descriptors
      
        File descriptors are used for connections and file operations. As Elasticsearch grows and scales, this number will increase,
        particularly when it is under heavy load. If this number reaches the maximum, then new connections and file operations cannot occur
        until old ones have closed. This will cause intermittent node failures.
        
        The current maximum value can be read by calling /_nodes/process and reading
        nodes.process.max_file_descriptors; the maximum value will not change after the node has started.
      
      
        Increase the system's maximum file descriptor count, which is OS specific (see ulimit for many Linux distributions).
      
    
      Java Heap Size
      jvm
      jvm.mem.heap_used_percent
      80% of total heap for 10 minutes
      90% of total heap for 10 minutes
      
        The Java Virtual Machine (JVM) heap is the main memory used by the Java processes, which includes Elasticsearch. Like any other
        process, if it runs out of memory, then it will crash and lead to node failures.
      
      
        Increase the maximum heap setting for Elasticsearch. In some cases, this may require actually adding more memory to the server.
      
    
      HTTP Connections
      http
      http.total_opened
      Increases by greater than 50 per minute.
      Increases by greater than 100 per minute.
      
        The number of HTTP connections can be indicative of server demand by request, but not by content. Some requests are much easier to
        fulfill than others, but having too many requests--simple or complex--can cause a node issues.
      
      
        Better distributing a workload (adding more nodes) is the easiest way to reduce the number of connections to a particular node. In
        some cases, it may be possible to tune client software to send fewer requests per minute if the software is written to poll
        Elasticsearch on an interval. Naturally, avoiding the need to make requests is the easiest way to reduce the number that come in,
        such as duplicating requests unnecessary.
        
        Many clients, such as the PHP client and Javascript client, create connections for
        every request. In those cases, you just cannot avoid it, but in other cases it can
        be avoided. For instance, using the .NET client, it will default to HTTP pipelining,
        which allows a single HTTP connection to be reused. Using appropriate Keep Alive
        times, persistent connections, and pipelining from the client can greatly reduce the
        number of connections, which can reduce network overhead.
        

        For those clients that do not allow those features to be controlled, it can help to
        setup a proxy for Elasticsearch (e.g., using Nginx) so that the proxy can use those
        features, which allows you to continue using "harmful" clients while still reducing
        the number of overall connections.
      
    
      Thread Pool Rejections
      thread_pool
      thread_pool.POOL.rejected
      Increased by greater than X per minute.
      Increased by greater than X per minute.
      
        The number of rejected threads per minute should stay roughly the same on a healthy cluster, which should be used to determine
        X. A rejected thread means the requested action did not occur (e.g.,
        nodes.NODE.thread_pool.get.rejected indicates failed get requests) because the associated thread pool was
        full; it does not mean that it started, but failed later.
      
      
        Adding more available threads to the problematic thread pool can lower the number of rejections, but, in general,
        
        this should only be done when directed by support. Providing access to more processing power or processors can ease thread
        pool congestion by allowing threads to finish more quickly or more in parallel.
      
    
      Thread Pool Queue
      thread_pool
      thread_pool.POOL.queue
      Increased by greater than X per minute.
      Increased by greater than X per minute.
      
        The number of threads per minute should stay roughly the same on a healthy cluster, which should be used to determine
        X. Optimally, this value should be 0, but peak periods may reasonably see threads queued. A queued thread means the
        requested action has not occurred yet (e.g., nodes.NODE.thread_pool.get.queue indicates delayed get requests)
        because the associated thread pool was full; it means that it is waiting to be processed and it has not yet been rejected.
      
      
        Adding more available threads to the problematic thread pool can lower the number of queued threads, but, in general,
        
        this should only be done when directed by support. Providing access to more processing power or processors can ease thread
        pool congestion by allowing threads to finish more quickly or more in parallel.
      
    
      Load Average
      os
      os.load_average.AVERAGE
      Relative to nodes.NODE.os.available_processors from /_nodes/os.
      Relative to nodes.NODE.os.available_processors from /_nodes/os.
      
        The average processor load on the node. As the load approaches complete utilization of each processor, it means other server
        processes are not being executed and that likely some threads within the process may be executing more slowly.
        
        The load_average value is an array with the first element representing the average load for the past minute, the
        second element representing the average load for the previous 5 minutes, and the third (and last) element representing the
        average load for the past 15 minutes.
      
      
        Adding more available processors to the server chassis or virtual machine. Providing access to more processing power or processors
        can ease thread pool congestion by allowing threads to finish more quickly or more in parallel.
      
    
        Filter Cache Size
      
      indices/filter_cache
      indices.filter_cache.memory_size_in_bytes
      Total sum of cache sizes is greater than 60% of heap size.
      Total sum of cache sizes is greater than 70% of heap size.
      
        Elasticsearch uses caches to speed up frequently performed actions. If the caches take up too much memory, then it is possible to
        get into situations where the rest of the Elasticsearch process is waiting for memory to become available, which may cause actions
        to run slower.
        
        If the warning or error levels are reached, then lowering the level of the worst offending or least used caches can help to speed
        up Elasticsearch.
        

        Note: The ID Cache is the in-memory-join table maintaining Parent/Child relationships. There is not currently a setting to
        control the amount of memory used to maintain this relationship and there is not much that can be done to effect its
        footprint. Because it resides on the heap, it is still a good idea to monitor its
        usage. Starting in 1.1.0, the ID Cache is actually stored in the Field Data Cache,
        but both metrics are still reported separately.
      
      
        Each cache can be independently cleared, or you may choose to clear them all at the same time using the
        clear
        cache API. If levels are set to allow the behavior causing concern, then clearing the cache may just be
        delaying the problem.
        
        Updating the specific settings for each cache can be done to control situations that consistently cause issue. For example, setting
        the indices.cache.filter.size upon startup can limit the memory used by the filter cache.
      
    
      ID Cache Size
      indices/id_cache
      indices.id_cache.memory_size_in_bytes
    
    
        Field Data Size
      
      indices/fielddata
      indices.fielddata.memory_size_in_bytes
    
    
      Percolate Size
      indices/percolate
      indices.percolate.memory_size_in_bytes
    
    
          Query Cache Size (>= 1.4.0)
        
      
      indices/query_cache
      indices.query_cache.memory_size_in_bytes
    
  
Cluster Details /_cluster

The overall health of the cluster is an important aspect of Elasticsearch deployments that have more than a single shard. Understanding
these core metrics can lead to a more stable deployment.
All API calls in this table must be prefixed by /_cluster/{api-call}. Each JSON Path represents the relevant field starting at the
returned object's root (just {json-path}) from the specified API Call.
# This is how all of these commands should look:
curl -XGET localhost:9200/_cluster/{api-call}?pretty
For example:
# Status
curl -XGET localhost:9200/_cluster/health?pretty -d '{
  "cluster_name" : "elasticsearch-cluster-name",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 26,
  "active_shards" : 26,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 26
}'
To access the Status, status would be the expansion for {json-path}.

  
      Metric
      API Call
      JSON Path
      Warning
      Error
      Explanation
      Solution
    
  
      Status
      health
      status
      The status is "yellow" for more than 5 minutes.
      The status is "red".
      
        "green" is the desired status, which indicates a healthy cluster that has properly allocated all primary
        and replica shards. A "yellow" status indicates that at least one replica is missing, but all
        data is still searchable. The worst is the "red" status that indicates at least one primary
        shard, as well as its replicas, is missing; this means that searches will return partial results and indexing to the missing
        shard(s) will cause an exception.
      
      
        Investigate the cause of any node, shard or replica issues by checking the logs and monitoring other metrics. Once recovered, it
        is important to ensure that the original problem does not repeat itself. For example, if a node failed because its disk was full,
        then restarting it will not prevent the issue from immediately recurring.
      
    
      Data Nodes
      health
      number_of_data_nodes
      The value is less than expected.
      
        Elasticsearch clusters are only healthy if all data nodes are available. If data nodes are missing, then it is possible to only
        get partial results from queries as well as to fail to index data that would otherwise be put there.
      
    
      Master Node
      state/master_node
      master_node
      The value is different for any node.
      
        If any node is in disagreement about the master_node, then problems can quickly occur because the cluster is not in a
        safe state where they do agree on the master_node. A disagreement on the master_node is called a "split
        brain".
        
        In a split brain situation, there is effectively more than one cluster because multiple nodes are still running, but some cannot
        talk to others. Each cluster thinks that it is _the_ cluster and behaves like the
        other nodes are simply missing from it. The best way to detect a split brain is by
        checking /_cluster/state/master_node?local against _every_ node. Each
        node will their master node. If any nodes disagree, then that represents a split
        brain. (Marvel does this for you automatically!)
        

        Under the worst circumstances, outside connections are able to see _each_ side and the assumption is that they are working
        together in the background. In this scenario, it is possible to cause issues
      
      
        Stop the problematic nodes as soon as possible to avoid issues. Before restarting any nodes, ensure that network connectivity
        between all nodes is working properly so that intercommunication can happen.
      
    
Index Details /_stats?level=cluster

The overall health of an index is an important aspect of all searches, indexes (writes), and retrievals. If an index is having problems, then all users of the index will be directly effected.
All API results in this table are the result of /_stats?level=cluster (not specifying the level shows all indices by default, which is unnecessary). Each JSON Path represents the relevant field starting at the returned node sub-object (_all.total.{json-path}) from the API Call.
To determine the relative warning/error state, you must monitor values between lookups, looking for relatively large variance.
# This is how all of these commands should look:
curl -XGET localhost:9200/_stats?level=cluster\&human\&pretty
For example:
# Total Search Requests
curl -XGET localhost:9200/_stats?level=cluster\&human\&pretty -d '{
  "_shards" : {
    "total" : 52,
    "successful" : 26,
    "failed" : 0
  },
  "_all" : {

    ... removed for brevity ...

    "total" : {
      "docs" : {
        "count" : 495599,
        "deleted" : 0
      },
      "store" : {
        "size" : "362.4mb",
        "size_in_bytes" : 380072547,
        "throttle_time" : "2.1m",
        "throttle_time_in_millis" : 128652
      },

      ... removed for brevity ...

      "search" : {
        "open_contexts" : 0,
        "query_total" : 8664,
        "query_time" : "16.4s",
        "query_time_in_millis" : 16458,
        "query_current" : 0,
        "fetch_total" : 8664,
        "fetch_time" : "2.4s",
        "fetch_time_in_millis" : 2484,
        "fetch_current" : 0
      },

      ... removed for brevity ...

    }
  }
}'
To access the Total Search Requests, _all.total.search.query_total would be the expansion for _all.total.{json-path}.

  
      Metric
      JSON Path
      Explanation
      Solution
    
  
      Total Search Requests
      search.query_total
      
        The total number of queries (searches).
      
      
        Determine the cause for the sudden surge or drop in queries. It is possible that a connected application has lost all network
        connectivity to the cluster. Surges in queries could be innocent, unintentionally looped queries in application code, or the
        sign of possible abuse through connected applications.
      
    
      Total Search Request Time
      search.query_total_in_millis
      
        The total time spent on queries (searches) in milliseconds.
      
      
        In addition to looking at the number of requests, the complexity of requests is important. The search slow log should be
        checked for unexpected increases in the total time. The
        slow
        log must be manually enabled.
      
    
      Total Index Requests
      indexing.index_total
      
        The total number of indexes (writes).
      
      
        Determine the cause for the sudden surge or drop in indexes. It is possible that a connected application has lost all network
        connectivity to the cluster. Surges in indexing could be innocent, unintentionally looped operations in application code, or the
        sign of possible abuse through connected applications.
      
    
      Total Index Request Time
      indexing.index_total_in_millis
      
        The total time spent on indexing (writing) in milliseconds.
      
      
        In addition to looking at the number of requests, the complexity of requests is important. The index slow log should be
        checked for unexpected increases in the total time. The
        slow
        log must be manually enabled.
      
    
      Successful Get Requests
      get.exists_total
      
        The total number of get requests.
      
      
        Determine the cause for the sudden surge or drop in get requests. It is possible that a connected application has lost all network
        connectivity to the cluster. Surges in get requests could be innocent, unintentionally looped operations in application code, or
        the sign of possible abuse through connected applications (e.g., constantly reloading a webpage that internally does a get request
        for document(s)).
      
    
      Successful Get Request Time
      get.exists_total_in_millis
      
        The total time spent on get requests in milliseconds.
      
      
        Get requests can be performed in real time and they should not show change dramatic changes. Sudden changes should be investigated with
        regard to the overall health of shards, indices, and clusters.
      
    
      Missed Get Requests
      get.missing_total
      
        The total number of get request misses.
      
      
        Determine the cause for the sudden surge or drop in get requests. It is possible that a connected application has lost all network
        connectivity to the cluster. Surges in get requests could be innocent, unintentionally looped operations in application code, or
        the sign of possible abuse through connected applications (e.g., constantly reloading a webpage that internally does a get request
        for document(s)).
        
        The most obvious cause of frequently missed get requests are made by applications that have predictable IDs defined via a given
        pattern (e.g., associated with a user's ID) and indices that do not always have documents defined. Depending on the application,
        this may or may not be an issue.
      
    
      Missed Get Request Time
      get.missing_total_in_millis
      
        The total time spent on get request misses in milliseconds.
      
      
        Get requests can be performed in real time and they should not show dramatic changes. Sudden changes should be investigated with
        regard to the overall health of shards, indices, and clusters.
      
    
Elasticsearch Log Monitoring

Some information is available in Elasticsearch's logs. The following table provide some of the log output that should trigger a warning or
alert based on the assumed severity.
The log line should be loosely interpreted as a regular expression, so .* effectively represents a placeholder.

  
      Metric
      Monitored Text
      Seriousness
      Explanation
      Solution
    
  
      Out of Memory
      java.lang.OutOfMemoryError
      Error
      
        The running node ran out of memory.
      
      
        Determine the cause of the error, and adjust the memory issues accordingly (e.g., add more memory to the server or adjust caches).
        
        Be sure to use the appropriate Java heap size for your
        environment. The default Java heap size is 1 GB, which is great for development,
        but not for production in general.
      
    
      File Descriptor Issues
      java.io.*Exception .* (Too many open files)
      Error
      
        Elasticsearch has tried to open too many file descriptors.
      
      
        This generally means that you need to adjust the OS-level file descriptor settings to increase the number available to a single
        process.
      
    
      Internal Communication Failures
      java.io.StreamCorruptedException: invalid internal transport message format
      Error
      
        Elasticsearch communication failed internally between nodes.
      
      
        The solution depends largely on the reason for the failure. The most common failure is because nodes are running on different JVM
        versions.
      
    
      Corrupted Translog
      failed to retrieve translog after .* operations, ignoring the rest, considered corrupted
      Error
      
        Elasticsearch cannot parse a translog, which could mean data loss has occurred after a failed restart.
      
      
        The solution depends largely on the reason for the failure. Contact support.
      
    
      Lucene Merge Issues
      org.apache.lucene.index.MergePolicy$MergeException
      Warning
      
        An issue occurred while merging Lucene indices. This occurs related to a single Elasticsearch shard.
      
      
        The solution depends largely on the reason for the failure. Merging usually occurs automatically in the background and this could
        just be a non-issue. If the problem persists, then contact support.
      
    
      Lucene Index Issues
      org.apache.lucene.index.CorruptIndexException
      Warning
      
        An issue occurred while reading a Lucene index. This occurs related to a single Elasticsearch shard.
      
      
        The solution depends largely on the reason for the failure. If the problem persists, then contact support.
      
    
      Low Disk Space
      After allocating, node .* would have less than the required .*% free disk threshold .*, preventing allocation
      Warning
      
        The specified node is low on disk space.
      
      
        Allocate more disk space to the specified node.
Metric	API Call	JSON Path	Explanation	Solution
Java Version	jvm	jvm.version	Elasticsearch is written in Java. In rare cases, intercommunication between nodes will use the features that get tweaked between Java releases, which can cause issues when the Java version does not match on both sides of the communication. As such, not maintaining the same version can lead to unexpected failures at this level.	Maintain the same version of Java installed on all servers running Elasticsearch.
Elasticsearch Version		version	Elasticsearch is written with backward compatibility in mind, but sometimes non-backward compatible changes will occur to support desired features and improve performance.	Maintain the same version of Elasticsearch installed on all nodes.
Metric	API Call	JSON Path	Warning	Error	Explanation	Solution
Java GC CMS	jvm	jvm.gc.old.collection_count	Increased by greater than `X` per minute.	Increased by greater than `X` per minute.	The number of Concurrent Mark Sweep collections that run per minute should stay roughly the same on a healthy cluster, which should be used to determine `X`. Bursts in server load should be reflected by seeing bursts in collections, but they should stabilize with the load. On a cluster that is constantly needing to do more-and-more collects, the risk is that more-and-more time is being spent doing garbage collections rather than processing. In worst case scenarios, this can lead to slow responses and out of memory issues that eventually lead to failed nodes.	Increase the maximum heap setting for Elasticsearch. In some cases, this may require actually adding more memory to the server.
Available Disk Space	fs/data	fs.data.available_in_bytes	20% of total disk space left	10% of total disk space left	Running out of disk space means that nothing can be inserted or updated. As a result, the node will fail.	Add more disk space.
File Descriptors	process	process.open_file_descriptors	70% of the maximum number of file descriptors	90% of the maximum number of file descriptors	File descriptors are used for connections and file operations. As Elasticsearch grows and scales, this number will increase, particularly when it is under heavy load. If this number reaches the maximum, then new connections and file operations cannot occur until old ones have closed. This will cause intermittent node failures. The current maximum value can be read by calling `/_nodes/process` and reading `nodes.process.max_file_descriptors`; the maximum value will not change after the node has started.	Increase the system's maximum file descriptor count, which is OS specific (see `ulimit` for many Linux distributions).
Java Heap Size	jvm	jvm.mem.heap_used_percent	80% of total heap for 10 minutes	90% of total heap for 10 minutes	The Java Virtual Machine (JVM) heap is the main memory used by the Java processes, which includes Elasticsearch. Like any other process, if it runs out of memory, then it will crash and lead to node failures.	Increase the maximum heap setting for Elasticsearch. In some cases, this may require actually adding more memory to the server.
HTTP Connections	http	http.total_opened	Increases by greater than 50 per minute.	Increases by greater than 100 per minute.	The number of HTTP connections can be indicative of server demand by request, but not by content. Some requests are much easier to fulfill than others, but having too many requests--simple or complex--can cause a node issues.	Better distributing a workload (adding more nodes) is the easiest way to reduce the number of connections to a particular node. In some cases, it may be possible to tune client software to send fewer requests per minute if the software is written to poll Elasticsearch on an interval. Naturally, avoiding the need to make requests is the easiest way to reduce the number that come in, such as duplicating requests unnecessary. Many clients, such as the PHP client and Javascript client, create connections for every request. In those cases, you just cannot avoid it, but in other cases it can be avoided. For instance, using the .NET client, it will default to HTTP pipelining, which allows a single HTTP connection to be reused. Using appropriate Keep Alive times, persistent connections, and pipelining from the client can greatly reduce the number of connections, which can reduce network overhead. For those clients that do not allow those features to be controlled, it can help to setup a proxy for Elasticsearch (e.g., using Nginx) so that the proxy can use those features, which allows you to continue using "harmful" clients while still reducing the number of overall connections.
Thread Pool Rejections	thread_pool	thread_pool.POOL.rejected	Increased by greater than `X` per minute.	Increased by greater than `X` per minute.	The number of rejected threads per minute should stay roughly the same on a healthy cluster, which should be used to determine `X`. A rejected thread means the requested action did not occur (e.g., `nodes.NODE.thread_pool.get.rejected` indicates failed get requests) because the associated thread pool was full; it does not mean that it started, but failed later.	Adding more available threads to the problematic thread pool can lower the number of rejections, but, in general, this should only be done when directed by support. Providing access to more processing power or processors can ease thread pool congestion by allowing threads to finish more quickly or more in parallel.
Thread Pool Queue	thread_pool	thread_pool.POOL.queue	Increased by greater than `X` per minute.	Increased by greater than `X` per minute.	The number of threads per minute should stay roughly the same on a healthy cluster, which should be used to determine `X`. Optimally, this value should be 0, but peak periods may reasonably see threads queued. A queued thread means the requested action has not occurred yet (e.g., `nodes.NODE.thread_pool.get.queue` indicates delayed get requests) because the associated thread pool was full; it means that it is waiting to be processed and it has not yet been rejected.	Adding more available threads to the problematic thread pool can lower the number of queued threads, but, in general, this should only be done when directed by support. Providing access to more processing power or processors can ease thread pool congestion by allowing threads to finish more quickly or more in parallel.
Load Average	os	os.load_average.AVERAGE	Relative to `nodes.NODE.os.available_processors` from `/_nodes/os`.	Relative to `nodes.NODE.os.available_processors` from `/_nodes/os`.	The average processor load on the node. As the load approaches complete utilization of each processor, it means other server processes are not being executed and that likely some threads within the process may be executing more slowly. The `load_average` value is an array with the first element representing the average load for the past minute, the second element representing the average load for the previous 5 minutes, and the third (and last) element representing the average load for the past 15 minutes.	Adding more available processors to the server chassis or virtual machine. Providing access to more processing power or processors can ease thread pool congestion by allowing threads to finish more quickly or more in parallel.
Filter Cache Size	indices/filter_cache	indices.filter_cache.memory_size_in_bytes	Total sum of cache sizes is greater than 60% of heap size.	Total sum of cache sizes is greater than 70% of heap size.	Elasticsearch uses caches to speed up frequently performed actions. If the caches take up too much memory, then it is possible to get into situations where the rest of the Elasticsearch process is waiting for memory to become available, which may cause actions to run slower. If the warning or error levels are reached, then lowering the level of the worst offending or least used caches can help to speed up Elasticsearch. Note: The ID Cache is the in-memory-join table maintaining Parent/Child relationships. There is not currently a setting to control the amount of memory used to maintain this relationship and there is not much that can be done to effect its footprint. Because it resides on the heap, it is still a good idea to monitor its usage. Starting in 1.1.0, the ID Cache is actually stored in the Field Data Cache, but both metrics are still reported separately.	Each cache can be independently cleared, or you may choose to clear them all at the same time using the `clear cache` API. If levels are set to allow the behavior causing concern, then clearing the cache may just be delaying the problem. Updating the specific settings for each cache can be done to control situations that consistently cause issue. For example, setting the `indices.cache.filter.size` upon startup can limit the memory used by the filter cache.
ID Cache Size	indices/id_cache	indices.id_cache.memory_size_in_bytes
Field Data Size	indices/fielddata	indices.fielddata.memory_size_in_bytes
Percolate Size	indices/percolate	indices.percolate.memory_size_in_bytes
Query Cache Size (>= 1.4.0)	indices/query_cache	indices.query_cache.memory_size_in_bytes
Metric	JSON Path	Explanation	Solution
Total Search Requests	search.query_total	The total number of queries (searches).	Determine the cause for the sudden surge or drop in queries. It is possible that a connected application has lost all network connectivity to the cluster. Surges in queries could be innocent, unintentionally looped queries in application code, or the sign of possible abuse through connected applications.
Total Search Request Time	search.query_total_in_millis	The total time spent on queries (searches) in milliseconds.	In addition to looking at the number of requests, the complexity of requests is important. The search slow log should be checked for unexpected increases in the total time. The slow log must be manually enabled.
Total Index Requests	indexing.index_total	The total number of indexes (writes).	Determine the cause for the sudden surge or drop in indexes. It is possible that a connected application has lost all network connectivity to the cluster. Surges in indexing could be innocent, unintentionally looped operations in application code, or the sign of possible abuse through connected applications.
Total Index Request Time	indexing.index_total_in_millis	The total time spent on indexing (writing) in milliseconds.	In addition to looking at the number of requests, the complexity of requests is important. The index slow log should be checked for unexpected increases in the total time. The slow log must be manually enabled.
Successful Get Requests	get.exists_total	The total number of get requests.	Determine the cause for the sudden surge or drop in get requests. It is possible that a connected application has lost all network connectivity to the cluster. Surges in get requests could be innocent, unintentionally looped operations in application code, or the sign of possible abuse through connected applications (e.g., constantly reloading a webpage that internally does a get request for document(s)).
Successful Get Request Time	get.exists_total_in_millis	The total time spent on get requests in milliseconds.	Get requests can be performed in real time and they should not show change dramatic changes. Sudden changes should be investigated with regard to the overall health of shards, indices, and clusters.
Missed Get Requests	get.missing_total	The total number of get request misses.	Determine the cause for the sudden surge or drop in get requests. It is possible that a connected application has lost all network connectivity to the cluster. Surges in get requests could be innocent, unintentionally looped operations in application code, or the sign of possible abuse through connected applications (e.g., constantly reloading a webpage that internally does a get request for document(s)). The most obvious cause of frequently missed get requests are made by applications that have predictable IDs defined via a given pattern (e.g., associated with a user's ID) and indices that do not always have documents defined. Depending on the application, this may or may not be an issue.
Missed Get Request Time	get.missing_total_in_millis	The total time spent on get request misses in milliseconds.	Get requests can be performed in real time and they should not show dramatic changes. Sudden changes should be investigated with regard to the overall health of shards, indices, and clusters.
Metric	Monitored Text	Seriousness	Explanation	Solution
Out of Memory	java.lang.OutOfMemoryError	Error	The running node ran out of memory.	Determine the cause of the error, and adjust the memory issues accordingly (e.g., add more memory to the server or adjust caches). Be sure to use the appropriate Java heap size for your environment. The default Java heap size is `1 GB`, which is great for development, but not for production in general.
File Descriptor Issues	java.io.Exception . (Too many open files)	Error	Elasticsearch has tried to open too many file descriptors.	This generally means that you need to adjust the OS-level file descriptor settings to increase the number available to a single process.
Internal Communication Failures	java.io.StreamCorruptedException: invalid internal transport message format	Error	Elasticsearch communication failed internally between nodes.	The solution depends largely on the reason for the failure. The most common failure is because nodes are running on different JVM versions.
Corrupted Translog	failed to retrieve translog after .* operations, ignoring the rest, considered corrupted	Error	Elasticsearch cannot parse a translog, which could mean data loss has occurred after a failed restart.	The solution depends largely on the reason for the failure. Contact support.
Lucene Merge Issues	org.apache.lucene.index.MergePolicy$MergeException	Warning	An issue occurred while merging Lucene indices. This occurs related to a single Elasticsearch shard.	The solution depends largely on the reason for the failure. Merging usually occurs automatically in the background and this could just be a non-issue. If the problem persists, then contact support.
Lucene Index Issues	org.apache.lucene.index.CorruptIndexException	Warning	An issue occurred while reading a Lucene index. This occurs related to a single Elasticsearch shard.	The solution depends largely on the reason for the failure. If the problem persists, then contact support.
Low Disk Space	After allocating, node .* would have less than the required .% free disk threshold ., preventing allocation	Warning	The specified node is low on disk space.	Allocate more disk space to the specified node.