stephenleo/01 How to Choose the Best Nearest Neighbors Algorithm.md

## 01 How to Choose the Best Nearest Neighbors Algorithm.md

      
    Raw
  

              01 How to Choose the Best Nearest Neighbors Algorithm.md
            
          
    How to Choose the Best Nearest Neighbors Algorithm

All the code snippets for How to Choose the Best Nearest Neighbors Algorithm Medium post Link

  
## 02_create_environment.sh
conda create -n ann python=3.6 jupyterlab -y
conda activate ann
git clone https://github.com/erikbern/ann-benchmarks.git
cd ann-benchmarks/
pip install -r requirements.txt
python install.py --proc=8
pip install --upgrade pandas scipy
mkdir data

## 03_data.py
df.to_pickle('ann-benchmarks/data/custom-euclidean.pkl')
df.head()

## 04_custom_updates.py
# Paste this code to the end of ann-benchmarks/ann-benchmarks/datasets.py
def custom_dataset(out_fn, test_ratio, distance):
    # Function to handle our custom dataset

    import pandas as pd

    # Read the Data Frame
    # out_fn is of the form 'data/<dataset-name>.hdf5'
    df = pd.read_pickle(out_fn.split('.')[0]+'.pkl')

    # Convert single embedding column to numpy list of lists
    X = pd.DataFrame(df['emb'].tolist()).to_numpy()

    # Split Train and Test
    X_train, X_test = train_test_split(X, test_size=test_ratio)

    # Write HDF5 Output
    write_output(X_train, X_test, out_fn, distance)

# Create a new dictionary element to call our new function
# 20% of rows used as Test Set
# Euclidean distance used as measure for finding neighbors
DATASETS['custom-euclidean'] = lambda out_fn: custom_dataset(out_fn, test_ratio=0.2, distance='euclidean')

## 05_run.sh
python run.py --dataset='custom-euclidean' --parallelism=14

## 06_plot.sh
sudo /opt/conda/envs/ann/bin/python plot.py --dataset=custom-euclidean --y-log
	conda create -n ann python=3.6 jupyterlab -y
	conda activate ann
	git clone https://github.com/erikbern/ann-benchmarks.git
	cd ann-benchmarks/
	pip install -r requirements.txt
	python install.py --proc=8
	pip install --upgrade pandas scipy
	mkdir data
	df.to_pickle('ann-benchmarks/data/custom-euclidean.pkl')
	df.head()
	# Paste this code to the end of ann-benchmarks/ann-benchmarks/datasets.py
	def custom_dataset(out_fn, test_ratio, distance):
	# Function to handle our custom dataset

	import pandas as pd

	# Read the Data Frame
	# out_fn is of the form 'data/<dataset-name>.hdf5'
	df = pd.read_pickle(out_fn.split('.')[0]+'.pkl')

	# Convert single embedding column to numpy list of lists
	X = pd.DataFrame(df['emb'].tolist()).to_numpy()

	# Split Train and Test
	X_train, X_test = train_test_split(X, test_size=test_ratio)

	# Write HDF5 Output
	write_output(X_train, X_test, out_fn, distance)

	# Create a new dictionary element to call our new function
	# 20% of rows used as Test Set
	# Euclidean distance used as measure for finding neighbors
	DATASETS['custom-euclidean'] = lambda out_fn: custom_dataset(out_fn, test_ratio=0.2, distance='euclidean')