This document covers benchmarking and analysis of benchmark results for ANN search implementation provided by Lucene 9.1 and how it compares with ANN implementation from k-NN that is based on nmslib HNSW.
For this benchmark we’ll be using experimental branch of OpenSearch with integrated Lucene 9.1, that is based on OpenSearch 2.0. Some basic code from kNN plugin has been integrated to POC branch with minor additional adjustments that makes code compatible with OpenSearch.
Final deployed artifact is at POC level and supports only straightforward scenarios.
For benchmarking we’ll be using one dataset with following parameters
Dataset | Train size | Dimensions | Distance | Format |
---|---|---|---|---|
SIFT | 1.000.000 | 128 | L2 | hdf5 |
GloVe | 1.183.514 | 100 | Angular | hdf5 |
Every cluster has multiple dedicated leader nodes and one data node. For leader nodes we use c5.xlarge EC2 instance type. For data node we use r5.8xlarge EC2 instance type.
Memory allocation for data nodes set to 32Gb. While KNN plugin uses platform native implementation that uses off heap memory HNSW algorithm from Lucene 9.1 is Java native and uses heap to store generated graph.
Following set of parameters were used for experiments:
max connections | beam width | shard count | replica count |
---|---|---|---|
4-96 | 512 | 1 | 0 |
m | ef_construction | ef_search | shard count | replica count |
---|---|---|---|---|
4-96 | 512 | 512 | 1 | 0 |
For each implementation we conducted series of experiments with m
or max_connections
values from 4 to 96. Value for ef_connection
and beam_width
is fixed to 512
.
For k-NN plugin we haven't done additional warm-up before search query step.
For indexing, the following metrics were produced:
- indexing throughput (docs/second) — The number of documents per second that were made searchable. To calculate this metric, the following formula was used:
data_set_size / (total_index_time_s + total_refresh_time_s)
- index size after refresh (GB) — Size of the index in GB after the refresh step finishes.
For Querying, the following metrics were produced:
- p50, p90, p99 query latency (ms) — percentile metrics for the time it took to process a single query excluding network latency. For k-NN plugin we haven't done additional warm-up before search query step.
- recall@k (fraction between 0.0 and 1.0) — fraction of ground truth results found in results returned by the plugin.
- memory (GB) — Amount of memory used during search. For KNN is the memory taken by native data structures, we use plugin specific stats API to retrieve this information. For Lucene 9.x solution we've used OpenSearch stats API and combined heap used memory for all nodes.
For load test we are running query test on the cluster uninterruptedly for 168 hours (7 days). Indention was to check if there are changes in performance and/or cluster stability with time under constant load. Four different data sets and clusters with different configurations were used. There are no noticeable degradation in performance or cluster stability.
Based on collected metrics exiting k-NN plugin is capable of reaching higher recall values. When recall values for both solutions are equal solution based on Lucene 9.1 comsumes less memory (both index storage and query) and shows better latencies.
Solution based on Lucene 9.1 showed itself as stable, during 168 hours (7 days) of constant query load there were no degradation of performance or cluster stability.
Overall conclusion for Lucene 9.1 solution is following - it’s stable, shows better query latencies with lower memory consumption. The downsides of such solution are lower recall and much lower throughput during indexing.
I opened opensearch-project/documentation-website#1933 to document these and other differences.