Skip to content

Instantly share code, notes, and snippets.

@polyfractal
Created December 9, 2014 20:39
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save polyfractal/4c72f9a0242b30297387 to your computer and use it in GitHub Desktop.
Save polyfractal/4c72f9a0242b30297387 to your computer and use it in GitHub Desktop.
def search_es(q, lics):
fq = []
for lic in lics:
fq.append(
{
"bool": {
"must": [
{"term": {"source": lic[0]}},
{"range": {"level": {"gte": 1, "lte": lic[1]}}}
]
}
})
body = {
"size": 10,
"query": {
"filtered": {
"filter": {"bool": {"should": fq} },
"query": {
"match": {"text": ' '.join(q) }
}
}
}
}
if args.fac:
body["aggs"] = {
"levels": {"terms": {"field": "level"}},
"sources": {"terms": {"field": "source"}}
}
resp = requests.post(args.es, json.dumps(body))
return resp.json()['hits']['total']

Ran on my macbook air, half a million docs. Single node, 5 primary 0 replica. Node restarted between runs to make sure all caches cleared, etc.

Existing benchmark

$ python loadtester.py --es "http://localhost:9200/speedtest/_search" -i ../data/stoicism.txt -o test1.txt --ns 10000 --nt 3 --nf 10
0 26004 1.36110687256
1000 5561 0.0182199478149
2000 10516 0.0134048461914
3000 42137 0.0833399295807
4000 34922 0.0168430805206
5000 5408 0.00911998748779
6000 45315 0.0210130214691
7000 42732 0.0193800926208
8000 5393 0.0104150772095
9000 8031 0.015035867691

$ python analyser.py test1.txt
180868227 results in 10000 searches (mean 18086)
0.02s mean query time, 1.36s max, 0.01s min
50%   of qtimes <= 0.01s
90%   of qtimes <= 0.02s
99%   of qtimes <= 0.05s
99.9% of qtimes <= 0.16s

More optimized query

  • Replaces and/or/not with bool. Equivalent query, but bool is optimized to handle bitset-based filters (such as range/term)
  • Replaces numeric_range with range. Numeric_range is deprecated in 0.90.8 first off (replaced with a fielddata mode in the range filter). Secondly, it operates on fielddata instead of lucene-based range filtering, so it's comparing apples to oranges in this benchmark. Also...I find it tends to be slower
  • Replaces multiple should clauses in the query with a single match + multiple terms. A single match with multiple terms translates into multiple Lucene terms OR'd together. You don't need an extra Bool to wrap.
$ python loadtester.py --es "http://localhost:9200/speedtest/_search" -i ../data/stoicism.txt -o test2.txt --ns 10000 --nt 3 --nf 10
0 17234 0.469129085541
1000 10241 0.0103521347046
2000 18599 0.0117888450623
3000 9496 0.00943398475647
4000 7503 0.00943303108215
5000 47209 0.0126769542694
6000 50272 0.0118138790131
7000 6506 0.0116741657257
8000 43656 0.0117161273956
9000 44132 0.012815952301


$ python analyser.py test2.txt
196173610 results in 10000 searches (mean 19617)
0.01s mean query time, 0.47s max, 0.01s min
50%   of qtimes <= 0.01s
90%   of qtimes <= 0.01s
99%   of qtimes <= 0.02s
99.9% of qtimes <= 0.05s
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment