ElasticSearch Tuning in Anger
So. I ran into a great deal of stress around ElasticSearch/Logstash performance lately. These are just a few lessons learned, documented so I have a chance of finding them again.
Both ElasticSearch and Logstash produce logs. On my RHEL install they're located in /var/log/elasticsearch and /var/log/logstash. These will give you some idea of problems then things go really wrong. For example, in my case, ElasticSearch got so slow that Logstash would time out sending it logs. These issues show up in the logs. Also, Elasticsearch would start logging problems when JVM Garbage collection took longer than 30 seconds, which is a good indicator of memory pressure on ElasticSearch.
ElasticSearch (and Logstash when it's joined to an ES Cluster) processes tasks in a queue, that you can peek into. Before realizing this I didn't have any way to understand what was happening in ElasticSearch besides the logs. You can look at the pending tasks queue with this command