alexklibisz/elastiknn-big-ann-benchmarks-setup.md

## elastiknn-big-ann-benchmarks-setup.md

      
    Raw
  

              elastiknn-big-ann-benchmarks-setup.md
            
          
    This document describes how to run Elastiknn on the big-ann-benchmarks challenge. It's admittedly a little late in the game for this benchmarking challenge. IIRC the deadline is October 22, 2021, and I'm writing this on October 15. But hey, the neighbors aren't gonna find themselves. We can still use this as an opportunity to improve Elastiknn.
The setup is currently pretty experimental, so bring your elbow grease.
Part 1: Setup the Elastiknn project

Clone the alexklibisz/elastiknn repo and checkout the elastiknn-278-lucene-benchmarks branch. That's where I've been working on the big-ann-benchmarks integration and improvements.

git clone git@github.com:alexklibisz/elastiknn.git
git fetch --all
git checkout elastiknn-278-lucene-benchmarks

Make sure you can produce a Jar from the project. It might help to refer to the developer guide.

$ ./gradlew shadowJar
...
BUILD SUCCESSFUL ...
$ find . -name 'ann-benchmarks-*.jar'
./elastiknn-ann-benchmarks/build/libs/ann-benchmarks-7.14.1.1-all.jar

To be extra sure things work, you can try running the test suite:

$ task cluster:run
... docker containers booting up ...
$ task jvm:test

Part 2: Setup the big-ann-benchmarks project

Clone the harsha-simhadri/big-ann-benchmarks repo and checkout my elastiknn branch. _Make sure that this is in an adjacent directory with the elastiknn project, e.g., ~/elastiknn and ~/big-ann-benchmarks.

$ git clone git@github.com:harsha-simhadri/big-ann-benchmarks.git
$ git remote add alexklibisz git@github.com:alexklibisz/big-ann-benchmarks.git
$ git fetch --all
$ git checkout alexklibisz/elastiknn

Setup your python environment according to the readmes in the repo. I just used virtualenv.
Make sure you can run the unit tests. Here they are as standalone commands, copied from the big-ann-benchmarks CI workflow:

$ export LIBRARY=httpann_example
$ export ALGORITHM=httpann_example
$ export DATASET=random-xs
$ python install.py
$ python create_dataset.py --dataset $DATASET
$ python run.py --algorithm $ALGORITHM --max-n-algorithms 2 --dataset $DATASET --timeout 600
$ sudo chmod -R 777 results/
$ python plot.py --dataset $DATASET --output plot.png
The last command should produce output like this:
Computing knn metrics
  0: http-ann-example-euclidean-1.0        1.000     3790.862
Found cached result
  1: http-ann-example-euclidean-0.2        0.390     3144.295
Computing knn metrics
  2: http-ann-example-euclidean-0.8        0.925     3792.054


If that didn't work, don't proceed. Nothing else will work.
Run the test.sh script. This will go over to the elastiknn directory, build the JAR, come back to the big-ann-benchmarks directory, build the docker container, download a dataset, and start running ann on that dataset. I recommend setting the DATASET variable in test.sh to DATASET=msturing-1M. That dataset has only 1M vectors, so it's fast enough to see if things work. The problem with this dataset is that is has no ground-truth, so you can't actually compute recall.
Then set DATASET=deep-10M to run on a 10x larger dataset which does have grouth-truth.

Part 3: Solve ANN
Some tips:

All of the elastiknn code for big-ann-benchmarks is in elastiknn-ann-benchmarks/src/main/scala/com/elastiknn/annb. The main entrypoint is Server.scala, which is an akka-http server that accepts requests from the Python Elastiknn model.
The elastiknn model in big-ann-benchmarks lives in benchmarks/algorithms/elastiknn.py. It's based on the HttpANN algorithm. Read this PR to understand how that works.
The runner.py is modified to expose JVM/JMX metrics on port 9091. This means we can use VisualVM to connect to this port and profile the JVM.
Here are some hyperameter settings and results I've been able to get on the deep-10M dataset. These are running on my Dell XPS i7 w/ 6 cores, 12 threads, and 12 Lucene segments. They are abismally slow. We definitely need some algorithmic improvements to get to billion scale:


L
k
w
candidates
probes
recall
qps


100
3
1
100
3
0.832
2.641


100
3
1
1000
1
0.896
1.441


100
3
1
100
0
0.523
7.001


100
3
1
100
1
0.700
4.348


100
3
1
1000
0
0.756
1.840


100
3
1
100
6
0.893
1.756
L	k	w	candidates	probes	recall	qps
100	3	1	100	3	0.832	2.641
100	3	1	1000	1	0.896	1.441
100	3	1	100	0	0.523	7.001
100	3	1	100	1	0.700	4.348
100	3	1	1000	0	0.756	1.840
100	3	1	100	6	0.893	1.756