I need to perform special ranking based on some criteria.
Before playing with my precise need, I’ve tested different possibilities, namely:
- Using built-in queries
- Using custom script in mvel (the fastest scripting facility according to ES documentation)
- Using a native script plugin written in Java
I present here the results of the comparative performance test
4 cores with HyperThreading (seen as 8)
cat /proc/cpuinfo
vendor_id : GenuineIntel
cpu family : 6
model : 26
model name : Intel® Xeon® CPU E5520 @ 2.27GHz
stepping : 5
cpu MHz : 2260.734
cache size : 8192 KB
physical id : 0
siblings : 8
core id : 3
cpu cores : 4
apicid : 7
initial apicid : 7
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt lahf_lm ida tpr_shadow vnmi flexpriority ept vpid
bogomips : 4521.92
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:
6 GB of RAM
ElasticSearch: v0.16.1
Index size: 53430000
docs, 7.6 GB
on disk
Queries run with pretty printing (does not add much)
Mapping:
curl -XGET ‘localhost:9200/_mapping?pretty=on’
{
“md5hash_spl” : {
“md5” : {
“properties” : {
“md5” : {
“omit_term_freq_and_positions” : true,
“include_in_all” : false,
“omit_norms” : true,
“store” : “yes”,
“analyzer” : “md5_hashsplitter”,
“type” : “string”
}
},
“_all” : {
“enabled” : false
}
}
}
}
The configuration:
# Cluster Settings
cluster:
name: es-test-hashsplittingnetwork:
publish_host: localhost
- Gateway Settings
gateway:
recover_after_nodes: 1
recover_after_time: 5s
expected_nodes: 1index:
analysis: analyzer: md5_hashsplitter: type: custom tokenizer: md5_hashsplitter_tokenizer tokenizer: md5_hashsplitter_tokenizer: type: hash_splitter chunk_length: 4 prefixes: ABCDEFGH
number_of_shards: 1
number_of_replicas: 0
You can disregard special analyzer as there are not used for querying.
Each document constists of a single field having 8 values composed of 5 characters combining upto 8*16^4 different possibilities, having a uniform repartition.
The central limit theorem tells us that this leads to a normal law for term collisions, which is verified in practice.
But again, this should not alter the tests.
Reference:
{"query":{"match_all":{}}}
took 1775 ms
Native query:
{"query":{"constant_score":{"filter":{"match_all":{}},"boost":1000}}}
took 1780
ms
Performance drop: less than 0.2%
worse
Simple Mvel script:
{"query":{"custom_score":{"query":{"match_all":{}},"lang":"mvel","script":1000}}}
took 110253
ms
Performance drop: almost 61 times
worse
Native Java plugin script simply returning a constant:
{"query":{"custom_score":{"query":{"match_all":{}},"params":{"constant":1000},"lang":"native","script":"constant"}}}
took 3230
ms
Performance drop: Near 80%
worse
curl -XGET 'localhost:9200/_cluster/nodes/stats?pretty=on' { "cluster_name" : "es-test-hashsplitting", "nodes" : { "c_E1SbYpQ6iJIIsZJPBGZQ" : { "name" : "Tempus", "indices" : { "size" : "7.5gb", "size_in_bytes" : 8095676001, "docs" : { "num_docs" : 53430000 }, "cache" : { "field_evictions" : 0, "field_size" : "0b", "field_size_in_bytes" : 0, "filter_count" : 17, "filter_evictions" : 0, "filter_mem_evictions" : 0, "filter_size" : "6.3mb", "filter_size_in_bytes" : 6679216 }, "merges" : { "current" : 0, "total" : 0, "total_time" : "0s", "total_time_in_millis" : 0 } }, "os" : { "timestamp" : 1306745999897, "uptime" : "1 hour, 24 minutes and 40 seconds", "uptime_in_millis" : 5080000, "load_average" : [ 0.22, 0.22, 0.3 ], "cpu" : { "sys" : 0, "user" : 0, "idle" : 98 }, "mem" : { "free" : "3.6gb", "free_in_bytes" : 3881447424, "used" : "2.2gb", "used_in_bytes" : 2384564224, "free_percent" : 72, "used_percent" : 27, "actual_free" : "4.2gb", "actual_free_in_bytes" : 4516585472, "actual_used" : "1.6gb", "actual_used_in_bytes" : 1749426176 }, "swap" : { "used" : "0b", "used_in_bytes" : 0, "free" : "3.8gb", "free_in_bytes" : 4094681088 } }, "process" : { "timestamp" : 1306745999898, "cpu" : { "percent" : 0, "sys" : "3 seconds and 800 milliseconds", "sys_in_millis" : 3800, "user" : "6 minutes, 29 seconds and 540 milliseconds", "user_in_millis" : 389540, "total" : "-1 milliseconds", "total_in_millis" : -1 }, "mem" : { "resident" : "222.6mb", "resident_in_bytes" : 233443328, "share" : "10.7mb", "share_in_bytes" : 11251712, "total_virtual" : "1.4gb", "total_virtual_in_bytes" : 1510789120 }, "fd" : { "total" : 229 } }, "jvm" : { "timestamp" : 1306745999899, "uptime" : "26 minutes, 4 seconds and 228 milliseconds", "uptime_in_millis" : 1564228, "mem" : { "heap_used" : "102.8mb", "heap_used_in_bytes" : 107807944, "heap_committed" : "265.5mb", "heap_committed_in_bytes" : 278462464, "non_heap_used" : "33.4mb", "non_heap_used_in_bytes" : 35027136, "non_heap_committed" : "37mb", "non_heap_committed_in_bytes" : 38797312 }, "threads" : { "count" : 40, "peak_count" : 41 }, "gc" : { "collection_count" : 11536, "collection_time" : "19 seconds and 508 milliseconds", "collection_time_in_millis" : 19508, "collectors" : { "ParNew" : { "collection_count" : 11534, "collection_time" : "19 seconds and 508 milliseconds", "collection_time_in_millis" : 19508 }, "ConcurrentMarkSweep" : { "collection_count" : 2, "collection_time" : "0 milliseconds", "collection_time_in_millis" : 0 } } } }, "network" : { "tcp" : { "active_opens" : 606, "passive_opens" : 100, "curr_estab" : 39, "in_segs" : 19219, "out_segs" : 16906, "retrans_segs" : 0, "estab_resets" : 41, "attempt_fails" : 4, "in_errs" : 0, "out_rsts" : 84 } }, "transport" : { "rx_count" : 0, "rx_size" : "0b", "rx_size_in_bytes" : 0, "tx_count" : 0, "tx_size" : "0b", "tx_size_in_bytes" : 0 } } } }
curl -XGET 'localhost:9200/_cluster/nodes?pretty=on' { "cluster_name" : "es-test-hashsplitting", "nodes" : { "c_E1SbYpQ6iJIIsZJPBGZQ" : { "name" : "Tempus", "transport_address" : "inet[localhost/127.0.0.1:9300]", "attributes" : { }, "http_address" : "inet[localhost/127.0.0.1:9200]", "os" : { "refresh_interval" : 5000, "cpu" : { "vendor" : "Intel", "model" : "Xeon", "mhz" : 2260, "total_cores" : 8, "total_sockets" : 8, "cores_per_socket" : 16, "cache_size" : "8kb", "cache_size_in_bytes" : 8192 }, "mem" : { "total" : "5.8gb", "total_in_bytes" : 6266011648 }, "swap" : { "total" : "3.8gb", "total_in_bytes" : 4094681088 } }, "process" : { "refresh_interval" : 5000, "id" : 4785 }, "jvm" : { "pid" : 4785, "version" : "1.6.0_24", "vm_name" : "Java HotSpot(TM) 64-Bit Server VM", "vm_version" : "19.1-b02", "vm_vendor" : "Sun Microsystems Inc.", "start_time" : 1306744435671, "mem" : { "heap_init" : "256mb", "heap_init_in_bytes" : 268435456, "heap_max" : "1011.2mb", "heap_max_in_bytes" : 1060372480, "non_heap_init" : "23.1mb", "non_heap_init_in_bytes" : 24313856, "non_heap_max" : "130mb", "non_heap_max_in_bytes" : 136314880 } }, "network" : { "refresh_interval" : 5000, "primary_interface" : { "address" : "192.168.1.142", "name" : "eth0", "mac_address" : "00:25:64:BE:0E:59" } }, "transport" : { "bound_address" : "inet[/0:0:0:0:0:0:0:0:9300]", "publish_address" : "inet[localhost/127.0.0.1:9300]" } } } }