Skip to content

Instantly share code, notes, and snippets.

@clintongormley
Created April 17, 2011 18:18
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save clintongormley/0382ed3913f0c3e40d62 to your computer and use it in GitHub Desktop.
Save clintongormley/0382ed3913f0c3e40d62 to your computer and use it in GitHub Desktop.
Clinton Gormley says:
Your comment is awaiting moderation.
April 17, 2011 at 7:37 am
Hiya
I don’t know much about Solr, but there are a few issues with the ElasticSearch side of your test:
First, your bulk data format is incorrect – you weren’t indexing what you thought you were indexing. The format is:
{ metadata_1 }\n
{ data_1 }\n
{ metadata_2 }\n
{ data_2 }\n
So your examples above should look like:
{"index" : {"_index" : "test", "_id" : "1582039702", "_type" : "type1"}}
{"field1" : "1184645701"}
{"index" : {"_index" : "test", "_id" : "937868144", "_type" : "type1"}}
{"field1" : "410491235"}
{"index" : {"_index" : "test", "_id" : "1754417430", "_type" : "type1"}}
{"field1" : "763134804"}
By default ES has a 100MB post limit, which can be configured. But that is the reason that you got your error when you tried to post all 10,000,000 records at once.
Second, in Solr, you need to commit your changes for them to become visible, which is only happening at the end of your file load.
In ElasticSearch, changes are refreshed every second (by default). This refresh interval is configurable. In 0.15.2 you have to configure it in the config file when you start ES (
index.refresh_interval: -1
or when you create the index).
However, in master (and the soon-to-be-released 0.16) there is an API for configuring refresh interval on the fly (eg before doing bulk updates). See Update settings
Third, you state that in Solr you are indexing your field as a string. However, your data looks like a number, and in ES it will thus be mapped as a number. Numbers are not analyzed as text, just stored as a single term.
Fourth, ES is built to scale automatically, so while Solr starts a single Lucene instance (again, I stand under correction), ES by default starts 5 (known as primary shards).
Fifth, Solr 3.1 is based on Lucene 3.1 which has speed improvements over 3.0. ES 0.15.2 is based on 3.0, and master (0.16) on 3.1.
Sixth, ES returns a lot of data when you bulk index. Solr returns a short XML document. In your test, you are letting curl dump this to the terminal, which is very slow. I’d suggest dumping it to > /dev/null instead.
So, if you want to compare like with like, I’d suggest creating your index first:
curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1' -d '
{
"mappings" : {
"type1" : {
"properties" : {
"field1" : {
"type" : "string"
}
}
}
},
"settings" : {
"index" : {
"number_of_replicas" : 1,
"number_of_shards" : 1,
"refresh_interval" : -1
}
}
}
'
And after indexing, call a manual refresh:
split es.json -l 100000
time {
for f in x??; do
curl -H 'Expect:' -XPUT 'http://localhost:9200/_bulk/' --data-binary @$f > /dev/null
done;
curl -XPOST 'http://127.0.0.1:9200/test/_refresh';
}
I tried running this locally with default settings, and there were some large pauses while indexing, caused by garbage collection. However, reducing the number of docs per file from 500,000 to 50,000 solved this problem, and greatly increased indexing speed.
(Also, in my local tests on 0.15.2, it seems that setting the refresh_interval to -1 is NOT actually stopping the automatic refresh, which sounds like a bug.)
Clint
@kimchy
Copy link

kimchy commented Apr 18, 2011

Mapping wise, to really compare, it should be set to this:

curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1'  -d '
    {
       "mappings" : {
          "type1" : {
             "_source" : {"enabled" : false},
             "_all" : {"enabled" : false},
             "_id" : {"index" : "no"},
             "_type" : {"index" : "no"},
             "properties" : {
                "field1" : {
                   "type" : "string",
                   "index" : "not_analyzed",
                   "omit_norms" : true
                }
             }
          }
       },
       "settings" : {
          "index" : {
             "number_of_replicas" : 1,
             "number_of_shards" : 1,
             "refresh_interval" : -1
          }
       }
    }
    '

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment