martin-gaievski/benchmarks.md

## benchmarks.md

      
    Raw
  

              benchmarks.md
            
          
    Benchmark for Lucene ANN - Lucene 9.1

Summary

This document covers benchmarking and analysis of benchmark results for ANN search implementation provided by Lucene 9.1 and how it compares with ANN implementation from k-NN that is based on nmslib HNSW.
Search algorithm

For this benchmark we’ll be using experimental branch of OpenSearch with integrated Lucene 9.1, that is based on OpenSearch 2.0. Some basic code from kNN plugin has been integrated to POC branch with minor additional adjustments that makes code compatible with OpenSearch.
Final deployed artifact is at POC level and supports only straightforward scenarios.
Environment

Data set

For benchmarking we’ll be using one dataset with following parameters


Dataset
Train size
Dimensions
Distance
Format


SIFT
1.000.000
128
L2
hdf5


GloVe
1.183.514
100
Angular
hdf5


Cluster Configuration

Every cluster has multiple dedicated leader nodes and one data node. For leader nodes we use c5.xlarge EC2 instance type. For data node we use r5.8xlarge EC2 instance type.
Memory allocation for data nodes set to 32Gb. While KNN plugin uses platform native implementation that uses off heap memory HNSW algorithm from Lucene 9.1 is Java native and uses heap to store generated graph.
Algorithm parameters

Following set of parameters were used for experiments:
Lucene hnsw


max connections
beam width
shard count
replica count


4-96
512
1
0


KNN hnsw


m
ef_construction
ef_search
shard count
replica count


4-96
512
512
1
0


For each implementation we conducted series of experiments with m or max_connections values from 4 to 96. Value for ef_connection and beam_width is fixed to 512.
For k-NN plugin we haven't done additional warm-up before search query step.
Index Test

Metrics

For indexing, the following metrics were produced:

indexing throughput (docs/second) — The number of documents per second that were made searchable. To calculate this metric, the following formula was used:

data_set_size / (total_index_time_s + total_refresh_time_s)

index size after refresh (GB) — Size of the index in GB after the refresh step finishes.

Results

Index throughput - SIFT


Index throughput - GloVe


Index size - SIFT


Index size - GloVe


Query Tests

Metrics

For Querying, the following metrics were produced:

p50, p90, p99 query latency (ms) — percentile metrics for the time it took to process a single query excluding network latency. For k-NN plugin we haven't done additional warm-up before search query step.
recall@k (fraction between 0.0 and 1.0) — fraction of ground truth results found in results returned by the plugin.
memory (GB) — Amount of memory used during search. For KNN is the memory taken by native data structures, we use plugin specific stats API to retrieve this information. For Lucene 9.x solution we've used OpenSearch stats API and combined heap used memory for all nodes.

Results

p50 - SIFT


p50 - GloVe


p90 - SIFT


p90 - GloVe


p99 - SIFT


p99 - GloVe


Query memory - SIFT


Query memory - GloVe


Load Tests

For load test we are running query test on the cluster uninterruptedly for 168 hours (7 days). Indention was to check if there are changes in performance and/or cluster stability with time under constant load. Four different data sets and clusters with different configurations were used.
There are no noticeable degradation in performance or cluster stability.
Conclusion

Based on collected metrics exiting k-NN plugin is capable of reaching higher recall values. When recall values for both solutions are equal solution based on Lucene 9.1 comsumes less memory (both index storage and query) and shows better latencies.
Solution based on Lucene 9.1 showed itself as stable, during 168 hours (7 days) of constant query load there were no degradation of performance or cluster stability.
Overall conclusion for Lucene 9.1 solution is following - it’s stable, shows better query latencies with lower memory consumption. The downsides of such solution are lower recall and much lower throughput during indexing.