Last active
September 25, 2015 16:59
-
-
Save fscottfoti/3e227ba516f06c0bdccb to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
def make_key(row): | |
return "cx="+(row.centroidx/10).round(0).astype('str') + \ | |
", cy=" + (row.centroidy/10).round(0).astype('str') + \ | |
", ar=" + (row.area/25).round(0).astype('str') + \ | |
", l=" + (row.length/10).round(0).astype('str') + \ | |
", bbminx=" + (row.minx/2).round(0).astype('str') + \ | |
", bbmaxy=" + (row.maxy/2).round(0).astype('str') | |
df["new_geom_id"] = df.apply(make_key, axis=1) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
This snippet is an attempt to map a geometry to the most stable key possible. The problem statement is that we are trying to make stable parcel identifiers, but we don't control the source of parcels (the counties do) and periodically must update the parcels as they change. The identifiers change, but we want to make sure that if the geometry does not change, we want to keep the identifier the same, thus we're trying to hash the geometry into a key.
The first approach to do this is just to call ST_asText and hash the result. This works well, but perhaps not well enough. The reason is that if any operation is applied - e.g. perhaps there's a reprojection, or the file is opened in ArcGIS, or on a different operating system, sometimes the least significant digit changes which throws off the entire hash.
The function here is intended to reduce the precision of the geometry - specifically, the centroid, area, length, and bounding box, combining into a string, and only after doing the reduction in precision do we hash the resulting string. It also works well but not perfectly.
Reducing precision into a key in this manner can actually create a small handful of "overlaps." Out of the 2M parcels in the Bay Area database, 100 of them will end up mapped to the same key using this method. @mkreilly has looked at this and they are indeed similar shapes which by random chance have the same attributes to the level of precision described above. On the other end, I performed a reprojection and opened and saved using ArcGIS and the result changed the geometry enough that 155 parcels would now have different keys. (I don't have the numbers for the previous method - just hashing the ST_asText, but I believe it was only a couple hundred more.)
The conclusion is that parcel geometry is a messy world and no method is likely to be perfect.
NOTE that this method requires a centroidx, centroidy, area, length, and bounding box minx and maxy to be computed somehow. I have code to do this with geopandas, but it's not required that these attributes be computed with geopandas.