Skip to content

Instantly share code, notes, and snippets.

@alexklibisz
Last active October 15, 2021 16:06
Show Gist options
  • Save alexklibisz/d3e47e30d3c7468f3ee56844e5ee6f7a to your computer and use it in GitHub Desktop.
Save alexklibisz/d3e47e30d3c7468f3ee56844e5ee6f7a to your computer and use it in GitHub Desktop.
Elastiknn / Big-ann-benchmarks Setup

This document describes how to run Elastiknn on the big-ann-benchmarks challenge. It's admittedly a little late in the game for this benchmarking challenge. IIRC the deadline is October 22, 2021, and I'm writing this on October 15. But hey, the neighbors aren't gonna find themselves. We can still use this as an opportunity to improve Elastiknn.

The setup is currently pretty experimental, so bring your elbow grease.

Part 1: Setup the Elastiknn project

  1. Clone the alexklibisz/elastiknn repo and checkout the elastiknn-278-lucene-benchmarks branch. That's where I've been working on the big-ann-benchmarks integration and improvements.
git clone git@github.com:alexklibisz/elastiknn.git
git fetch --all
git checkout elastiknn-278-lucene-benchmarks
  1. Make sure you can produce a Jar from the project. It might help to refer to the developer guide.
$ ./gradlew shadowJar
...
BUILD SUCCESSFUL ...
$ find . -name 'ann-benchmarks-*.jar'
./elastiknn-ann-benchmarks/build/libs/ann-benchmarks-7.14.1.1-all.jar
  1. To be extra sure things work, you can try running the test suite:
$ task cluster:run
... docker containers booting up ...
$ task jvm:test

Part 2: Setup the big-ann-benchmarks project

  1. Clone the harsha-simhadri/big-ann-benchmarks repo and checkout my elastiknn branch. _Make sure that this is in an adjacent directory with the elastiknn project, e.g., ~/elastiknn and ~/big-ann-benchmarks.
$ git clone git@github.com:harsha-simhadri/big-ann-benchmarks.git
$ git remote add alexklibisz git@github.com:alexklibisz/big-ann-benchmarks.git
$ git fetch --all
$ git checkout alexklibisz/elastiknn
  1. Setup your python environment according to the readmes in the repo. I just used virtualenv.
  2. Make sure you can run the unit tests. Here they are as standalone commands, copied from the big-ann-benchmarks CI workflow:
$ export LIBRARY=httpann_example
$ export ALGORITHM=httpann_example
$ export DATASET=random-xs
$ python install.py
$ python create_dataset.py --dataset $DATASET
$ python run.py --algorithm $ALGORITHM --max-n-algorithms 2 --dataset $DATASET --timeout 600
$ sudo chmod -R 777 results/
$ python plot.py --dataset $DATASET --output plot.png

The last command should produce output like this:

Computing knn metrics
  0: http-ann-example-euclidean-1.0        1.000     3790.862
Found cached result
  1: http-ann-example-euclidean-0.2        0.390     3144.295
Computing knn metrics
  2: http-ann-example-euclidean-0.8        0.925     3792.054
  1. If that didn't work, don't proceed. Nothing else will work.
  2. Run the test.sh script. This will go over to the elastiknn directory, build the JAR, come back to the big-ann-benchmarks directory, build the docker container, download a dataset, and start running ann on that dataset. I recommend setting the DATASET variable in test.sh to DATASET=msturing-1M. That dataset has only 1M vectors, so it's fast enough to see if things work. The problem with this dataset is that is has no ground-truth, so you can't actually compute recall.
  3. Then set DATASET=deep-10M to run on a 10x larger dataset which does have grouth-truth.

Part 3: Solve ANN

Some tips:

  • All of the elastiknn code for big-ann-benchmarks is in elastiknn-ann-benchmarks/src/main/scala/com/elastiknn/annb. The main entrypoint is Server.scala, which is an akka-http server that accepts requests from the Python Elastiknn model.
  • The elastiknn model in big-ann-benchmarks lives in benchmarks/algorithms/elastiknn.py. It's based on the HttpANN algorithm. Read this PR to understand how that works.
  • The runner.py is modified to expose JVM/JMX metrics on port 9091. This means we can use VisualVM to connect to this port and profile the JVM.
  • Here are some hyperameter settings and results I've been able to get on the deep-10M dataset. These are running on my Dell XPS i7 w/ 6 cores, 12 threads, and 12 Lucene segments. They are abismally slow. We definitely need some algorithmic improvements to get to billion scale:
L k w candidates probes recall qps
100 3 1 100 3 0.832 2.641
100 3 1 1000 1 0.896 1.441
100 3 1 100 0 0.523 7.001
100 3 1 100 1 0.700 4.348
100 3 1 1000 0 0.756 1.840
100 3 1 100 6 0.893 1.756
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment