Created
August 15, 2020 20:37
-
-
Save martinfleis/abc7cdbf9f9266bf9ed369080eec7cea to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"metadata": { | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.7.6-final" | |
}, | |
"orig_nbformat": 2, | |
"kernelspec": { | |
"name": "geo_dev", | |
"display_name": "geo_dev" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 2, | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Binary predicates in GeoPandas 1.0 - proposal\n", | |
"\n", | |
"Current behaviour of binary predicates in GeoPandas (e.g. `series1.intersects(sereis2)`) is not expected by many users (#972, #1316, #1422) and the expected one is not available. This notebook illustrates a proposal for a new behaviour, which should be implemented in 0.9 release and become default in 1.0." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 10, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"from random import randint\n", | |
"\n", | |
"import numpy as np\n", | |
"import pandas as pd\n", | |
"import geopandas as gpd\n", | |
"\n", | |
"from shapely.geometry import Point" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 20, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"pts = [Point(randint(0, 100), randint(0, 100)) for _ in range(100)]\n", | |
"polys = [Point(randint(0, 100), randint(0, 100)).buffer(5, cap_style=3) for _ in range(100)]" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 21, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"pt = gpd.array.from_shapely(pts)\n", | |
"pl = gpd.array.from_shapely(polys)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Current behaviour\n", | |
"\n", | |
"Current behaviour is pair-wise. GeoSeries gets aligned and binary operation is done for each row." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 22, | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "execute_result", | |
"data": { | |
"text/plain": "array([False, False, False, False, False, False, False, False, False,\n False, False, False, False, False, False, False, False, False,\n False, False, False, False, False, False, False, False, False,\n False, False, False, False, False, False, False, False, False,\n False, False, False, False, True, False, False, False, False,\n False, False, False, False, False, False, False, False, False,\n False, False, False, False, False, False, False, False, False,\n False, False, False, False, False, False, False, False, False,\n False, False, False, False, False, False, False, False, False,\n False, False, False, False, False, False, False, False, False,\n False, False, False, False, False, False, False, False, False,\n False])" | |
}, | |
"metadata": {}, | |
"execution_count": 22 | |
} | |
], | |
"source": [ | |
"pl.intersects(pt)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Users expect two kinds of behaviour:\n", | |
"\n", | |
"1. Does each object in `left` intersect _any_ object in `right`?\n", | |
"2. Which objects from `right` intersect each object from `left`? - note that this is behaviour of `sf` and PostGIS (I think).\n", | |
"\n", | |
"## Proposal\n", | |
"\n", | |
"New kewyords, controlling the behaviour switching between current behaviour, boolean check against all geometries and list of indices. Not sure about the ideal way to handle it, see [#972](https://github.com/geopandas/geopandas/issues/972) for ongoing discussion. Users could then choose between:\n", | |
"\n", | |
"1. (Current) paired check\n", | |
"2. Boolean check (intersects any?)\n", | |
"3. List of indices\n", | |
"\n", | |
"Other option is to ignore second option and point to `~result_indices.isna()` for `any`.\n", | |
"\n", | |
"### Implementation\n", | |
"\n", | |
"Implementation could be done on `GeometryArray` level using `query_bulk` as following:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 23, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# temporary wrap in GeoSeries, assuming #1444 will be merged and sindex will be on array level\n", | |
"inp, res = gpd.GeoSeries(pt).sindex.query_bulk(pl, predicate='intersects')" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Boolean check:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 34, | |
"metadata": { | |
"tags": [] | |
}, | |
"outputs": [ | |
{ | |
"output_type": "execute_result", | |
"data": { | |
"text/plain": "array([False, True, True, True, True, True, True, True, True,\n True, True, True, False, False, False, True, False, True,\n True, True, True, True, False, True, True, True, True,\n False, True, True, True, True, True, True, True, True,\n True, True, True, True, True, True, False, False, True,\n True, False, False, False, False, True, True, True, True,\n True, True, True, True, True, False, False, True, False,\n True, True, True, False, True, False, True, True, True,\n True, True, False, False, False, False, True, False, True,\n True, False, True, False, False, True, False, True, True,\n True, True, True, True, True, True, False, True, True,\n True])" | |
}, | |
"metadata": {}, | |
"execution_count": 34 | |
} | |
], | |
"source": [ | |
"# get boolean (does it intersect any?)\n", | |
"result = np.isin(np.arange(0, len(pl)), inp)\n", | |
"result" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"List of indices:" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 25, | |
"metadata": { | |
"tags": [] | |
}, | |
"outputs": [ | |
{ | |
"output_type": "execute_result", | |
"data": { | |
"text/plain": "array([None, array([14]), array([29, 37, 21]), array([63, 46, 35]),\n array([11]), array([22]), array([97]), array([54]), array([45]),\n array([29, 37, 41]), array([11, 33, 15]), array([64, 32, 90, 51]),\n None, None, None, array([9, 4]), None, array([35]), array([77]),\n array([91]), array([34]), array([51]), None, array([72]),\n array([89, 77]), array([96, 53]), array([34]), None,\n array([12, 43]), array([38, 95]), array([63, 35]), array([68, 75]),\n array([68]), array([81]), array([71]), array([25, 57]),\n array([89, 77]), array([9, 4]), array([42]), array([30]),\n array([17, 40]), array([52]), None, None,\n array([90, 24, 38, 95, 43]), array([51]), None, None, None, None,\n array([47]), array([59]), array([14, 79]), array([11, 15]),\n array([87, 49]), array([61, 30]), array([13, 86]), array([36]),\n array([76, 88, 25]), None, None, array([90, 24, 38, 95]), None,\n array([80]), array([97]), array([96, 53]), None, array([73]), None,\n array([13, 86]), array([14, 93]), array([57, 6, 50]),\n array([59, 20]), array([ 8, 69]), None, None, None, None,\n array([57, 6, 50]), None, array([21, 17]), array([94, 76]), None,\n array([52]), None, None, array([21, 17]), None, array([48, 78]),\n array([52]), array([56, 79]), array([16]), array([82, 0, 7, 81]),\n array([97]), array([5]), array([66, 29, 62]), None,\n array([29, 62, 41]), array([83]), array([13, 86])], dtype=object)" | |
}, | |
"metadata": {}, | |
"execution_count": 25 | |
} | |
], | |
"source": [ | |
"# get indices\n", | |
"unique, counts = np.unique(inp, return_counts=True)\n", | |
"values = np.split(res, np.cumsum(counts)[:-1])\n", | |
"indices = np.empty(len(pl), dtype='object')\n", | |
"indices[unique] = values\n", | |
"indices" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Benchmark\n", | |
"\n", | |
"Excluding `query_bulk` to compare two options.\n", | |
"\n", | |
"100 000 points and 10 000 polygons." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 46, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"pts = [Point(randint(0, 100), randint(0, 100)) for _ in range(100000)]\n", | |
"polys = [Point(randint(0, 100), randint(0, 100)).buffer(5, cap_style=3) for _ in range(10000)]\n", | |
"\n", | |
"pt = gpd.array.from_shapely(pts)\n", | |
"pl = gpd.array.from_shapely(polys)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 47, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"inp, res = gpd.GeoSeries(pt).sindex.query_bulk(pl, predicate='intersects')" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 48, | |
"metadata": { | |
"tags": [] | |
}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"name": "stdout", | |
"text": "170 ms ± 7.64 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" | |
} | |
], | |
"source": [ | |
"%%timeit\n", | |
"# get boolean (does it intersect any?)\n", | |
"result = np.isin(np.arange(0, len(pl)), inp)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 49, | |
"metadata": { | |
"tags": [] | |
}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"name": "stdout", | |
"text": "191 ms ± 5.35 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n" | |
} | |
], | |
"source": [ | |
"%%timeit\n", | |
"# get indices\n", | |
"unique, counts = np.unique(inp, return_counts=True)\n", | |
"values = np.split(res, np.cumsum(counts)[:-1])\n", | |
"indices = np.empty(len(pl), dtype='object')\n", | |
"indices[unique] = values" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Including `query_bulk`." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 50, | |
"metadata": { | |
"tags": [] | |
}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"name": "stdout", | |
"text": "10.6 s ± 159 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" | |
} | |
], | |
"source": [ | |
"%%timeit\n", | |
"inp, res = gpd.GeoSeries(pt).sindex.query_bulk(pl, predicate='intersects')\n", | |
"# get boolean (does it intersect any?)\n", | |
"result = np.isin(np.arange(0, len(pl)), inp)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": 51, | |
"metadata": { | |
"tags": [] | |
}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"name": "stdout", | |
"text": "10.5 s ± 156 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n" | |
} | |
], | |
"source": [ | |
"%%timeit\n", | |
"inp, res = gpd.GeoSeries(pt).sindex.query_bulk(pl, predicate='intersects')\n", | |
"# get indices\n", | |
"unique, counts = np.unique(inp, return_counts=True)\n", | |
"values = np.split(res, np.cumsum(counts)[:-1])\n", | |
"indices = np.empty(len(pl), dtype='object')\n", | |
"indices[unique] = values" | |
] | |
} | |
] | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment