Skip to content

Instantly share code, notes, and snippets.

@martin-gaievski
Last active November 16, 2022 20:16
Show Gist options
  • Save martin-gaievski/7a0d83f57d607392642f87e43f40f606 to your computer and use it in GitHub Desktop.
Save martin-gaievski/7a0d83f57d607392642f87e43f40f606 to your computer and use it in GitHub Desktop.
Benchmarks for ANN based on Lucene 9.1 HNSW implementation

Benchmark for Lucene ANN - Lucene 9.1

Summary

This document covers benchmarking and analysis of benchmark results for ANN search implementation provided by Lucene 9.1 and how it compares with ANN implementation from k-NN that is based on nmslib HNSW.

Search algorithm

For this benchmark we’ll be using experimental branch of OpenSearch with integrated Lucene 9.1, that is based on OpenSearch 2.0. Some basic code from kNN plugin has been integrated to POC branch with minor additional adjustments that makes code compatible with OpenSearch.

Final deployed artifact is at POC level and supports only straightforward scenarios.

Environment

Data set

For benchmarking we’ll be using one dataset with following parameters

Dataset Train size Dimensions Distance Format
SIFT 1.000.000 128 L2 hdf5
GloVe 1.183.514 100 Angular hdf5

Cluster Configuration

Every cluster has multiple dedicated leader nodes and one data node. For leader nodes we use c5.xlarge EC2 instance type. For data node we use r5.8xlarge EC2 instance type.

Memory allocation for data nodes set to 32Gb. While KNN plugin uses platform native implementation that uses off heap memory HNSW algorithm from Lucene 9.1 is Java native and uses heap to store generated graph.

Algorithm parameters

Following set of parameters were used for experiments:

Lucene hnsw

max connections beam width shard count replica count
4-96 512 1 0

KNN hnsw

m ef_construction ef_search shard count replica count
4-96 512 512 1 0

For each implementation we conducted series of experiments with m or max_connections values from 4 to 96. Value for ef_connection and beam_width is fixed to 512.

For k-NN plugin we haven't done additional warm-up before search query step.

Index Test

Metrics

For indexing, the following metrics were produced:

  1. indexing throughput (docs/second) — The number of documents per second that were made searchable. To calculate this metric, the following formula was used:

data_set_size / (total_index_time_s + total_refresh_time_s)

  1. index size after refresh (GB) — Size of the index in GB after the refresh step finishes.

Results

Index throughput - SIFT

image

Index throughput - GloVe

image

Index size - SIFT

image

Index size - GloVe

image

Query Tests

Metrics

For Querying, the following metrics were produced:

  1. p50, p90, p99 query latency (ms) — percentile metrics for the time it took to process a single query excluding network latency. For k-NN plugin we haven't done additional warm-up before search query step.
  2. recall@k (fraction between 0.0 and 1.0) — fraction of ground truth results found in results returned by the plugin.
  3. memory (GB) — Amount of memory used during search. For KNN is the memory taken by native data structures, we use plugin specific stats API to retrieve this information. For Lucene 9.x solution we've used OpenSearch stats API and combined heap used memory for all nodes.

Results

p50 - SIFT

image

p50 - GloVe

image

p90 - SIFT

image

p90 - GloVe

image

p99 - SIFT

image

p99 - GloVe

image

Query memory - SIFT

image

Query memory - GloVe

image

Load Tests

For load test we are running query test on the cluster uninterruptedly for 168 hours (7 days). Indention was to check if there are changes in performance and/or cluster stability with time under constant load. Four different data sets and clusters with different configurations were used. There are no noticeable degradation in performance or cluster stability.

Conclusion

Based on collected metrics exiting k-NN plugin is capable of reaching higher recall values. When recall values for both solutions are equal solution based on Lucene 9.1 comsumes less memory (both index storage and query) and shows better latencies.

Solution based on Lucene 9.1 showed itself as stable, during 168 hours (7 days) of constant query load there were no degradation of performance or cluster stability.

Overall conclusion for Lucene 9.1 solution is following - it’s stable, shows better query latencies with lower memory consumption. The downsides of such solution are lower recall and much lower throughput during indexing.

@dblock
Copy link

dblock commented Nov 16, 2022

I opened opensearch-project/documentation-website#1933 to document these and other differences.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment