RickardCardell/debug_logstash_elasticsearch_timeouts.md

## debug_logstash_elasticsearch_timeouts.md

      
    Raw
  

              debug_logstash_elasticsearch_timeouts.md
            
          
    Debugging bulk requests from Logstash to Elasticsearch

We use the ELK-stack to centralize the Hadoop logs. The logs are collected by Logstash and pushed to an Elasticsearch cluster.

Logstash 2.1, 2.4
Elasticsearch 1.7

Recently we have had issues with Logstash agents hanging upon sending bulk reqests to Elasticsearch. All its pipelines could stall stall for 15 minutes or even get stuck requiring a restart. Updating the Logstash agents to v2.4 did improve the situation a bit but we still had a lot of long lasting timeouts. Let's check the logs!
{:timestamp=>"2016-10-19T04:39:08.017000+0200", :message=>"Attempted to send a bulk request to Elasticsearch configured at '[\"http://elk1:9200\", \"http://elk2:9200\", \"http://elk3:9200\", \"http://elk4:9200\", \"http://elk5:9200\"]', but Elasticsearch appears to be unreachable or down!", :error_message=>"Connection timed out", :class=>"Manticore::SocketException", :level=>:error}
{:timestamp=>"2016-10-19T04:39:17.341000+0200", :message=>"Attempted to send a bulk request to Elasticsearch configured at '[\"http://elk1:9200\", \"http://elk2:9200\", \"http://elk3:9200\", \"http://elk4:9200\", \"http://elk5:9200\"]', but Elasticsearch appears to be unreachable or down!", :error_message=>"Connection timed out", :class=>"Manticore::SocketException", :level=>:error}
{:timestamp=>"2016-10-19T04:54:39.844000+0200", :message=>"Attempted to send a bulk request to Elasticsearch configured at '[\"http://elk1:9200\", \"http://elk2:9200\", \"http://elk3:9200\", \"http://elk4:9200\", \"http://elk5:9200\"]', but Elasticsearch appears to be unreachable or down!", :error_message=>"Connection timed out", :class=>"Manticore::SocketException", :level=>:error}

Not very helpful.

No information on what node it sent the request to. Well, we know it was one of elk1-elk5...
No information on how large the requests was in terms of bytes or in document count

The Elasticsearch logs was not usable at all. Not a single error or warn event so I focused on Logstash.
Improving log output

In order to improve the debugging situation a bit I ended up hacking up some code which made the debugging a bit easier
https://github.com/logstash-plugins/logstash-output-elasticsearch/compare/v2.7.1...RickardCardell:more-verbose-info-logging
https://github.com/elastic/elasticsearch-ruby/compare/v1.1.0...RickardCardell:body-size-in-logging-output
{:timestamp=>"2016-11-10T04:47:52.401000+0100", :message=>" bulk request", :num_docs=>1, :indices=>"flume-logs-2016.11.10", :body_size=>637, :level=>:info}
{:timestamp=>"2016-11-10T04:47:52.401000+0100", :message=>"Performing request to http://elk1:9200/_bulk. body_size: 637", :level=>:info}
{:timestamp=>"2016-11-10T05:03:18.027000+0100", :message=>"[Manticore::SocketException] Connection timed out {:scheme=>\"http\", :host=>\"es1\", :path=>\"\", :port=>9200, :protocol=>\"http\"}. body_size: 637", :level=>:error}
{:timestamp=>"2016-11-10T05:03:18.027000+0100", :message=>"Attempted to send a bulk request to Elasticsearch configured at '[\"http://es1:9200\", \"http://es2:9200\", \"es3:9200\"]', but Elasticsearch appears to be unreachable or down!", :error_message=>"Connection timed out", :class=>"Manticore::SocketException", :num_docs=>1, :indices=>"flume-logs-2016.11.10", :body_size=>637, :level=>:error}
{:timestamp=>"2016-11-10T05:03:20.385000+0100", :message=>" bulk request", :num_docs=>1, :indices=>"flume-logs-2016.11.10", :body_size=>637, :level=>:info}
{:timestamp=>"2016-11-10T05:03:20.386000+0100", :message=>"Performing request to http://es1:9200/_bulk. body_size: 637", :level=>:info}
{:timestamp=>"2016-11-10T05:03:20.394000+0100", :message=>"POST http://es1:9200/_bulk [status:200, request:0.008s, query:0.003s]. body_size: 637", :level=>:info}

We could now see

exact node
body size
number of documents
the elasticsearch indices

Time to dig into the logs!
Finding the cause

When the timeout occurs it is always on small bulk-requests to the small indices (~50k docs). Also, we could see that all nodes could timeout so no real pattern there.
The issue might be that the bulk-requests are too small and too many. The vast majority of bulk requests only contained one or two documents which is obviously far from optimal.
One probable cause is how the logstash-elasticsearch-output gather events for each bulk-request: it will flush to Elasticsearch after a maximum number of events (flush_size) or if it hasn't recieved any event after in a period of time (idle_flush_time). In our case we had +20 different log locations with its own outputs and most of them logged events in a rather small pace: less than 60 events/minute.
The bulk-requests never reached the flush_size of 500 events, instead they flushed only after a few documents due to the slow rate of log events, resulting in a lot of small bulk-requests: 20 requests/second per node.
By tuning the elasticsearch-output we were able to reduce the number of bulk-requests:
elasticsearch {
  ...
  Should cause fewer and larger bulk-requests. Flush when there haven't been another even since this amount of seconds even if the bulk isn't full.
  idle_flush_time => 5  # default=1

  # Should be nicer to an overloaded ES cluster. Timeout retry interval.
  retry_max_interval => 6 # default=2

  # Timeout faster to release the connection resource
  timeout => 10 # default=60?
}

Result

Still rather small bulk-requests, but they have increased with 2-3x in size. The number of bulk-reqests has decreased by the same amount. So far the situation is a lot better with less frequent timeouts but the big gain was that the timeouts are less serious, i.e only one retry is needed after a timeout. Far better then before where the timeouts could go on for hours.
{:timestamp=>"2016-11-11T03:42:44.855000+0100", :message=>" bulk request", :num_docs=>6, :indices=>"service-logs-yarn-warnings-2016.11.11", :body_size=>3967, :level=>:info}
{:timestamp=>"2016-11-11T03:42:44.856000+0100", :message=>"Performing request to http://elk3:9200/_bulk. body_size: 3967", :level=>:info}
{:timestamp=>"2016-11-11T03:42:44.865000+0100", :message=>"POST http://elk3:9200/_bulk [status:200, request:0.007s, query:0.003s]. body_size: 3967", :level=>:info}
{:timestamp=>"2016-11-11T03:42:46.894000+0100", :message=>" bulk request", :num_docs=>5, :indices=>"service-logs-yarn-warnings-2016.11.11", :body_size=>3306, :level=>:info}
{:timestamp=>"2016-11-11T03:42:46.895000+0100", :message=>"Performing request to http://elk4:9200/_bulk. body_size: 3306", :level=>:info}
{:timestamp=>"2016-11-11T03:42:46.904000+0100", :message=>"POST http://elk4:9200/_bulk [status:200, request:0.009s, query:0.005s]. body_size: 3306", :level=>:info}
{:timestamp=>"2016-11-11T03:42:48.963000+0100", :message=>" bulk request", :num_docs=>6, :indices=>"service-logs-yarn-warnings-2016.11.11", :body_size=>3967, :level=>:info}
{:timestamp=>"2016-11-11T03:42:48.964000+0100", :message=>"Performing request to http://elk5:9200/_bulk. body_size: 3967", :level=>:info}
{:timestamp=>"2016-11-11T03:42:48.974000+0100", :message=>"POST http://elk5:9200/_bulk [status:200, request:0.009s, query:0.005s]. body_size: 3967", :level=>:info}
{:timestamp=>"2016-11-11T03:42:51.865000+0100", :message=>" bulk request", :num_docs=>3, :indices=>"service-logs-yarn-warnings-2016.11.11", :body_size=>1984, :level=>:info}
{:timestamp=>"2016-11-11T03:42:51.866000+0100", :message=>"Performing request to http://elk1:9200/_bulk. body_size: 1984", :level=>:info}
{:timestamp=>"2016-11-11T03:42:51.874000+0100", :message=>"POST http://elk1:9200/_bulk [status:200, request:0.007s, query:0.004s]. body_size: 1984", :level=>:info}
{:timestamp=>"2016-11-11T03:42:51.875000+0100", :message=>" bulk request", :num_docs=>3, :indices=>"flume-logs-2016.11.11", :body_size=>1924, :level=>:info}
{:timestamp=>"2016-11-11T03:42:51.876000+0100", :message=>"Performing request to http://elk2:9200/_bulk. body_size: 1924", :level=>:info}
{:timestamp=>"2016-11-11T03:42:51.889000+0100", :message=>"POST http://elk2:9200/_bulk [status:200, request:0.011s, query:0.003s]. body_size: 1924", :level=>:info}
{:timestamp=>"2016-11-11T03:42:55.041000+0100", :message=>" bulk request", :num_docs=>3, :indices=>"service-logs-yarn-warnings-2016.11.11", :body_size=>1984, :level=>:info}
{:timestamp=>"2016-11-11T03:42:55.042000+0100", :message=>"Performing request to http://elk2:9200/_bulk. body_size: 1984", :level=>:info}

Wrap up

This shows how important metrics are when debugging and tuning configurations. Small bulk-requests in a high pace was the culprit here but I could have spared a lot of time if there had been decent output from the Elasticsearch-output.
How are Logstash users supposed to tune the bulk_size or the idle_flush_time when there aren't any metrics to base it on?
I can see that Logstash 5.0 includes some new logging features (Log4j!) so now I have big hopes that the logging situation will be improved.