It seems we were committing index too often. I was calling commitWithin=4000
, on new/update of document. The old value had been working great for years, but I guess the volume of change was overwhelming the box. I've switched to a commitWithin=30000
and we've not had any timeout issues.
I think that having our index being up-to-date within 30 seconds is fine. If this begins to overwhelm things again, I see a few possiblities:
- put nxinx in front of Solr with a quick timeout so web requests won't hold. We were getting lots of
Resource temporarily unavailable - read would block
errors. Maybe there's a way to do that with jetty, but i'm treating the solr install like a black box. Java scares me. :)
- move to higher iops on the disk holding the index
- create a read slave
- move to elastisearch, or SolrCloud
What's interesting to me is that this was a shot in the dark. I got no good error messaging out of solr or apache. I just read some docs and they suggested commiting less frequently. It'd