Skip to content

Instantly share code, notes, and snippets.

@fscottfoti
Last active September 25, 2015 19:31
Show Gist options
  • Save fscottfoti/ac819bdfa8a4036adad3 to your computer and use it in GitHub Desktop.
Save fscottfoti/ac819bdfa8a4036adad3 to your computer and use it in GitHub Desktop.
from sklearn.neighbors import KDTree
def nearest_neighbor(df1, df2):
kdt = KDTree(df1.as_matrix())
distances, indexes = kdt.query(df2.as_matrix(), k=1, return_distance=True)
return pd.Series(distances.flatten(), index=df1.index.values[indexes.flatten()])
import sys
import pandas as pd
import numpy as np
args = sys.argv[1:]
df1 = pd.read_csv(args[0], index_col="GEOM_ID")
df1["area"] = df1.area.apply(np.sqrt)
df2 = pd.read_csv(args[1], index_col="GEOM_ID")
df2["area"] = df2.area.apply(np.sqrt)
s = nearest_neighbor(df1, df2).order()
print s.describe()
print s.tail()
GEOM_ID minx miny maxx maxy centroidx centroidy length area
9720406908141 610557.748247 4201444.69009 610580.952298 4201468.10975 610562.999503 4201462.7687 82.9033338922 115.961904597
9720406908140 610557.748242 4201444.69005 610580.977218 4201468.10993 610562.999506 4201462.76869 82.9035472875 115.962321392
@fscottfoti
Copy link
Author

This is another try at creating geom ids, but with a different strategy. Here I'm just trying to do "matches" of a list of geometries to another list of geometries. I do this by creating a dataframe which contains centroidx, centroidy, length, area, and the four corners of the bounding box and put them into a dataframe. One dataframe for the "haystack" and one for the "needles."

In my test, I'm pretty sure all the needles are present in the haystack, but several operations have been done that manage to apply some "noise." For my example (of the 2M parcels in the Bay Area), the describe of the distances to the nearest neighbor looks like this:

count    1951744.000000
mean           0.000098
std            0.000056
min            0.000004
25%            0.000065
50%            0.000085
75%            0.000117
max            0.024922

I added a csv of the 2 rows which are the two geometries which were furthest apart (that .0249222 number) - they still look pretty close.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment