clintongormley/gist:0382ed3913f0c3e40d62 Secret

## gistfile1.txt
    Clinton Gormley says:
    Your comment is awaiting moderation.
    April 17, 2011 at 7:37 am

    Hiya

    I don’t know much about Solr, but there are a few issues with the ElasticSearch side of your test:

    First, your bulk data format is incorrect – you weren’t indexing what you thought you were indexing. The format is:

       { metadata_1 }\n
       { data_1 }\n
       { metadata_2 }\n
       { data_2 }\n

    So your examples above should look like:

    {"index" : {"_index" : "test", "_id" : "1582039702", "_type" : "type1"}}
    {"field1" : "1184645701"}
    {"index" : {"_index" : "test", "_id" : "937868144", "_type" : "type1"}}
    {"field1" : "410491235"}
    {"index" : {"_index" : "test", "_id" : "1754417430", "_type" : "type1"}}
    {"field1" : "763134804"}

    By default ES has a 100MB post limit, which can be configured. But that is the reason that you got your error when you tried to post all 10,000,000 records at once.

    Second, in Solr, you need to commit your changes for them to become visible, which is only happening at the end of your file load.

    In ElasticSearch, changes are refreshed every second (by default). This refresh interval is configurable. In 0.15.2 you have to configure it in the config file when you start ES (

     index.refresh_interval: -1

    or when you create the index).

    However, in master (and the soon-to-be-released 0.16) there is an API for configuring refresh interval on the fly (eg before doing bulk updates). See Update settings

    Third, you state that in Solr you are indexing your field as a string. However, your data looks like a number, and in ES it will thus be mapped as a number. Numbers are not analyzed as text, just stored as a single term.

    Fourth, ES is built to scale automatically, so while Solr starts a single Lucene instance (again, I stand under correction), ES by default starts 5 (known as primary shards).

    Fifth, Solr 3.1 is based on Lucene 3.1 which has speed improvements over 3.0. ES 0.15.2 is based on 3.0, and master (0.16) on 3.1.

    Sixth, ES returns a lot of data when you bulk index. Solr returns a short XML document. In your test, you are letting curl dump this to the terminal, which is very slow. I’d suggest dumping it to > /dev/null instead.

    So, if you want to compare like with like, I’d suggest creating your index first:

    curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1'  -d '
    {
       "mappings" : {
          "type1" : {
             "properties" : {
                "field1" : {
                   "type" : "string"
                }
             }
          }
       },
       "settings" : {
          "index" : {
             "number_of_replicas" : 1,
             "number_of_shards" : 1,
             "refresh_interval" : -1
          }
       }
    }
    '

    And after indexing, call a manual refresh:

    split es.json -l 100000
    time {
      for f in x??; do
          curl -H 'Expect:' -XPUT 'http://localhost:9200/_bulk/' --data-binary @$f > /dev/null
      done;
      curl -XPOST 'http://127.0.0.1:9200/test/_refresh';
    }

    I tried running this locally with default settings, and there were some large pauses while indexing, caused by garbage collection. However, reducing the number of docs per file from 500,000 to 50,000 solved this problem, and greatly increased indexing speed.

    (Also, in my local tests on 0.15.2, it seems that setting the refresh_interval to -1 is NOT actually stopping the automatic refresh, which sounds like a bug.)

    Clint
	Clinton Gormley says:
	Your comment is awaiting moderation.
	April 17, 2011 at 7:37 am

	Hiya

	I don’t know much about Solr, but there are a few issues with the ElasticSearch side of your test:

	First, your bulk data format is incorrect – you weren’t indexing what you thought you were indexing. The format is:

	{ metadata_1 }\n
	{ data_1 }\n
	{ metadata_2 }\n
	{ data_2 }\n

	So your examples above should look like:

	{"index" : {"_index" : "test", "_id" : "1582039702", "_type" : "type1"}}
	{"field1" : "1184645701"}
	{"index" : {"_index" : "test", "_id" : "937868144", "_type" : "type1"}}
	{"field1" : "410491235"}
	{"index" : {"_index" : "test", "_id" : "1754417430", "_type" : "type1"}}
	{"field1" : "763134804"}

	By default ES has a 100MB post limit, which can be configured. But that is the reason that you got your error when you tried to post all 10,000,000 records at once.

	Second, in Solr, you need to commit your changes for them to become visible, which is only happening at the end of your file load.

	In ElasticSearch, changes are refreshed every second (by default). This refresh interval is configurable. In 0.15.2 you have to configure it in the config file when you start ES (

	index.refresh_interval: -1

	or when you create the index).

	However, in master (and the soon-to-be-released 0.16) there is an API for configuring refresh interval on the fly (eg before doing bulk updates). See Update settings

	Third, you state that in Solr you are indexing your field as a string. However, your data looks like a number, and in ES it will thus be mapped as a number. Numbers are not analyzed as text, just stored as a single term.

	Fourth, ES is built to scale automatically, so while Solr starts a single Lucene instance (again, I stand under correction), ES by default starts 5 (known as primary shards).

	Fifth, Solr 3.1 is based on Lucene 3.1 which has speed improvements over 3.0. ES 0.15.2 is based on 3.0, and master (0.16) on 3.1.

	Sixth, ES returns a lot of data when you bulk index. Solr returns a short XML document. In your test, you are letting curl dump this to the terminal, which is very slow. I’d suggest dumping it to > /dev/null instead.

	So, if you want to compare like with like, I’d suggest creating your index first:

	curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1' -d '
	{
	"mappings" : {
	"type1" : {
	"properties" : {
	"field1" : {
	"type" : "string"
	}
	}
	}
	},
	"settings" : {
	"index" : {
	"number_of_replicas" : 1,
	"number_of_shards" : 1,
	"refresh_interval" : -1
	}
	}
	}
	'

	And after indexing, call a manual refresh:

	split es.json -l 100000
	time {
	for f in x??; do
	curl -H 'Expect:' -XPUT 'http://localhost:9200/_bulk/' --data-binary @$f > /dev/null
	done;
	curl -XPOST 'http://127.0.0.1:9200/test/_refresh';
	}

	I tried running this locally with default settings, and there were some large pauses while indexing, caused by garbage collection. However, reducing the number of docs per file from 500,000 to 50,000 solved this problem, and greatly increased indexing speed.

	(Also, in my local tests on 0.15.2, it seems that setting the refresh_interval to -1 is NOT actually stopping the automatic refresh, which sounds like a bug.)

	Clint