polyfractal/Results.md

## gistfile1.py
def search_es(q, lics):
    fq = []
    for lic in lics:
        fq.append(
            {
                "bool": {
                    "must": [
                         {"term": {"source": lic[0]}},
                         {"range": {"level": {"gte": 1, "lte": lic[1]}}}
                    ]
                }
            })


    body = {
        "size": 10,
        "query": {
            "filtered": {
                "filter": {"bool": {"should": fq} },
                "query": {
                    "match": {"text": ' '.join(q) }
                }
            }
        }
    }

    if args.fac:
        body["aggs"] = {
            "levels": {"terms": {"field": "level"}},
            "sources": {"terms": {"field": "source"}}
        }

    resp = requests.post(args.es, json.dumps(body))
    return resp.json()['hits']['total']


## Results.md

      
    Raw
  

              Results.md
            
          
    Ran on my macbook air, half a million docs.  Single node, 5 primary 0 replica.  Node restarted between runs to make sure all caches cleared, etc.
Existing benchmark

$ python loadtester.py --es "http://localhost:9200/speedtest/_search" -i ../data/stoicism.txt -o test1.txt --ns 10000 --nt 3 --nf 10
0 26004 1.36110687256
1000 5561 0.0182199478149
2000 10516 0.0134048461914
3000 42137 0.0833399295807
4000 34922 0.0168430805206
5000 5408 0.00911998748779
6000 45315 0.0210130214691
7000 42732 0.0193800926208
8000 5393 0.0104150772095
9000 8031 0.015035867691

$ python analyser.py test1.txt
180868227 results in 10000 searches (mean 18086)
0.02s mean query time, 1.36s max, 0.01s min
50%   of qtimes <= 0.01s
90%   of qtimes <= 0.02s
99%   of qtimes <= 0.05s
99.9% of qtimes <= 0.16s

More optimized query


Replaces and/or/not with bool.  Equivalent query, but bool is optimized to handle bitset-based filters (such as range/term)
Replaces numeric_range with range.  Numeric_range is deprecated in 0.90.8 first off (replaced with a fielddata mode in the range filter).  Secondly, it operates on fielddata instead of lucene-based range filtering, so it's comparing apples to oranges in this benchmark.  Also...I find it tends to be slower
Replaces multiple should clauses in the query with a single match + multiple terms. A single match with multiple terms translates into multiple Lucene terms OR'd together.  You don't need an extra Bool to wrap.

$ python loadtester.py --es "http://localhost:9200/speedtest/_search" -i ../data/stoicism.txt -o test2.txt --ns 10000 --nt 3 --nf 10
0 17234 0.469129085541
1000 10241 0.0103521347046
2000 18599 0.0117888450623
3000 9496 0.00943398475647
4000 7503 0.00943303108215
5000 47209 0.0126769542694
6000 50272 0.0118138790131
7000 6506 0.0116741657257
8000 43656 0.0117161273956
9000 44132 0.012815952301


$ python analyser.py test2.txt
196173610 results in 10000 searches (mean 19617)
0.01s mean query time, 0.47s max, 0.01s min
50%   of qtimes <= 0.01s
90%   of qtimes <= 0.01s
99%   of qtimes <= 0.02s
99.9% of qtimes <= 0.05s
	def search_es(q, lics):
	fq = []
	for lic in lics:
	fq.append(
	{
	"bool": {
	"must": [
	{"term": {"source": lic[0]}},
	{"range": {"level": {"gte": 1, "lte": lic[1]}}}
	]
	}
	})


	body = {
	"size": 10,
	"query": {
	"filtered": {
	"filter": {"bool": {"should": fq} },
	"query": {
	"match": {"text": ' '.join(q) }
	}
	}
	}
	}

	if args.fac:
	body["aggs"] = {
	"levels": {"terms": {"field": "level"}},
	"sources": {"terms": {"field": "source"}}
	}

	resp = requests.post(args.es, json.dumps(body))
	return resp.json()['hits']['total']