Skip to content

Instantly share code, notes, and snippets.

@fscottfoti
Last active September 25, 2015 16:59
Show Gist options
  • Save fscottfoti/3e227ba516f06c0bdccb to your computer and use it in GitHub Desktop.
Save fscottfoti/3e227ba516f06c0bdccb to your computer and use it in GitHub Desktop.
def make_key(row):
return "cx="+(row.centroidx/10).round(0).astype('str') + \
", cy=" + (row.centroidy/10).round(0).astype('str') + \
", ar=" + (row.area/25).round(0).astype('str') + \
", l=" + (row.length/10).round(0).astype('str') + \
", bbminx=" + (row.minx/2).round(0).astype('str') + \
", bbmaxy=" + (row.maxy/2).round(0).astype('str')
df["new_geom_id"] = df.apply(make_key, axis=1)
@fscottfoti
Copy link
Author

This snippet is an attempt to map a geometry to the most stable key possible. The problem statement is that we are trying to make stable parcel identifiers, but we don't control the source of parcels (the counties do) and periodically must update the parcels as they change. The identifiers change, but we want to make sure that if the geometry does not change, we want to keep the identifier the same, thus we're trying to hash the geometry into a key.

The first approach to do this is just to call ST_asText and hash the result. This works well, but perhaps not well enough. The reason is that if any operation is applied - e.g. perhaps there's a reprojection, or the file is opened in ArcGIS, or on a different operating system, sometimes the least significant digit changes which throws off the entire hash.

The function here is intended to reduce the precision of the geometry - specifically, the centroid, area, length, and bounding box, combining into a string, and only after doing the reduction in precision do we hash the resulting string. It also works well but not perfectly.

Reducing precision into a key in this manner can actually create a small handful of "overlaps." Out of the 2M parcels in the Bay Area database, 100 of them will end up mapped to the same key using this method. @mkreilly has looked at this and they are indeed similar shapes which by random chance have the same attributes to the level of precision described above. On the other end, I performed a reprojection and opened and saved using ArcGIS and the result changed the geometry enough that 155 parcels would now have different keys. (I don't have the numbers for the previous method - just hashing the ST_asText, but I believe it was only a couple hundred more.)

The conclusion is that parcel geometry is a messy world and no method is likely to be perfect.

NOTE that this method requires a centroidx, centroidy, area, length, and bounding box minx and maxy to be computed somehow. I have code to do this with geopandas, but it's not required that these attributes be computed with geopandas.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment