Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save stephenleo/adbd851c3a5dc5a382c82e7382f0938a to your computer and use it in GitHub Desktop.
Save stephenleo/adbd851c3a5dc5a382c82e7382f0938a to your computer and use it in GitHub Desktop.
[Medium] How to Choose the Best Nearest Neighbors Algorithm

How to Choose the Best Nearest Neighbors Algorithm

All the code snippets for How to Choose the Best Nearest Neighbors Algorithm Medium post Link

conda create -n ann python=3.6 jupyterlab -y
conda activate ann
git clone https://github.com/erikbern/ann-benchmarks.git
cd ann-benchmarks/
pip install -r requirements.txt
python install.py --proc=8
pip install --upgrade pandas scipy
mkdir data
df.to_pickle('ann-benchmarks/data/custom-euclidean.pkl')
df.head()
# Paste this code to the end of ann-benchmarks/ann-benchmarks/datasets.py
def custom_dataset(out_fn, test_ratio, distance):
# Function to handle our custom dataset
import pandas as pd
# Read the Data Frame
# out_fn is of the form 'data/<dataset-name>.hdf5'
df = pd.read_pickle(out_fn.split('.')[0]+'.pkl')
# Convert single embedding column to numpy list of lists
X = pd.DataFrame(df['emb'].tolist()).to_numpy()
# Split Train and Test
X_train, X_test = train_test_split(X, test_size=test_ratio)
# Write HDF5 Output
write_output(X_train, X_test, out_fn, distance)
# Create a new dictionary element to call our new function
# 20% of rows used as Test Set
# Euclidean distance used as measure for finding neighbors
DATASETS['custom-euclidean'] = lambda out_fn: custom_dataset(out_fn, test_ratio=0.2, distance='euclidean')
python run.py --dataset='custom-euclidean' --parallelism=14
sudo /opt/conda/envs/ann/bin/python plot.py --dataset=custom-euclidean --y-log
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment