hugo53/gist:274edd2669e154e7b6a9001dada5ac4b

## gistfile1.txt
If you want, I can try and help with pointers as to how to improve the indexing speed you get. Its quite easy to really increase it by using some simple guidelines, for example:

- Use create in the index API (assuming you can).
- Relax the real time aspect from 1 second to something a bit higher (index.engine.robin.refresh_interval).
- Increase the indexing buffer size (indices.memory.index_buffer_size), it defaults to the value 10% which is 10% of the heap.
- Increase the number of dirty operations that trigger automatic flush (so the translog won't get really big, even though its FS based) by setting index.translog.flush_threshold (defaults to 5000).
- Increase the memory allocated to elasticsearch node. By default its 1g.
- Start with a lower replica count (even 0), and then once the bulk loading is done, increate it to the value you want it to be using the update_settings API. This will improve things as possibly less shards will be allocated to each machine.
- Increase the number of machines you have so you get less shards allocated per machine.
- Increase the number of shards an index has, so it can make use of more machines.
- Make sure you make full use of the concurrent aspect of elasticsearch. You might not pushing it hard enough. For example, the map reduce job can index things concurrently. Just make sure not to overload elasticsearch.
- Make Lucene use the non compound file format (basically, each segment gets compounded into a single file when using the compound file format). This will increase the number of open files, so make sure you have enough. Set index.merge.policy.use_compound_file to false.

If not using Java, there are more things to play with:

- Try and use the thrift client instead of HTTP.
	If you want, I can try and help with pointers as to how to improve the indexing speed you get. Its quite easy to really increase it by using some simple guidelines, for example:

	- Use create in the index API (assuming you can).
	- Relax the real time aspect from 1 second to something a bit higher (index.engine.robin.refresh_interval).
	- Increase the indexing buffer size (indices.memory.index_buffer_size), it defaults to the value 10% which is 10% of the heap.
	- Increase the number of dirty operations that trigger automatic flush (so the translog won't get really big, even though its FS based) by setting index.translog.flush_threshold (defaults to 5000).
	- Increase the memory allocated to elasticsearch node. By default its 1g.
	- Start with a lower replica count (even 0), and then once the bulk loading is done, increate it to the value you want it to be using the update_settings API. This will improve things as possibly less shards will be allocated to each machine.
	- Increase the number of machines you have so you get less shards allocated per machine.
	- Increase the number of shards an index has, so it can make use of more machines.
	- Make sure you make full use of the concurrent aspect of elasticsearch. You might not pushing it hard enough. For example, the map reduce job can index things concurrently. Just make sure not to overload elasticsearch.
	- Make Lucene use the non compound file format (basically, each segment gets compounded into a single file when using the compound file format). This will increase the number of open files, so make sure you have enough. Set index.merge.policy.use_compound_file to false.

	If not using Java, there are more things to play with:

	- Try and use the thrift client instead of HTTP.