Skip to content

Instantly share code, notes, and snippets.

@martinfleis
Created August 15, 2020 20:37
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save martinfleis/abc7cdbf9f9266bf9ed369080eec7cea to your computer and use it in GitHub Desktop.
Save martinfleis/abc7cdbf9f9266bf9ed369080eec7cea to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"metadata": {
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.6-final"
},
"orig_nbformat": 2,
"kernelspec": {
"name": "geo_dev",
"display_name": "geo_dev"
}
},
"nbformat": 4,
"nbformat_minor": 2,
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Binary predicates in GeoPandas 1.0 - proposal\n",
"\n",
"Current behaviour of binary predicates in GeoPandas (e.g. `series1.intersects(sereis2)`) is not expected by many users (#972, #1316, #1422) and the expected one is not available. This notebook illustrates a proposal for a new behaviour, which should be implemented in 0.9 release and become default in 1.0."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"from random import randint\n",
"\n",
"import numpy as np\n",
"import pandas as pd\n",
"import geopandas as gpd\n",
"\n",
"from shapely.geometry import Point"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"pts = [Point(randint(0, 100), randint(0, 100)) for _ in range(100)]\n",
"polys = [Point(randint(0, 100), randint(0, 100)).buffer(5, cap_style=3) for _ in range(100)]"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"pt = gpd.array.from_shapely(pts)\n",
"pl = gpd.array.from_shapely(polys)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Current behaviour\n",
"\n",
"Current behaviour is pair-wise. GeoSeries gets aligned and binary operation is done for each row."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": "array([False, False, False, False, False, False, False, False, False,\n False, False, False, False, False, False, False, False, False,\n False, False, False, False, False, False, False, False, False,\n False, False, False, False, False, False, False, False, False,\n False, False, False, False, True, False, False, False, False,\n False, False, False, False, False, False, False, False, False,\n False, False, False, False, False, False, False, False, False,\n False, False, False, False, False, False, False, False, False,\n False, False, False, False, False, False, False, False, False,\n False, False, False, False, False, False, False, False, False,\n False, False, False, False, False, False, False, False, False,\n False])"
},
"metadata": {},
"execution_count": 22
}
],
"source": [
"pl.intersects(pt)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Users expect two kinds of behaviour:\n",
"\n",
"1. Does each object in `left` intersect _any_ object in `right`?\n",
"2. Which objects from `right` intersect each object from `left`? - note that this is behaviour of `sf` and PostGIS (I think).\n",
"\n",
"## Proposal\n",
"\n",
"New kewyords, controlling the behaviour switching between current behaviour, boolean check against all geometries and list of indices. Not sure about the ideal way to handle it, see [#972](https://github.com/geopandas/geopandas/issues/972) for ongoing discussion. Users could then choose between:\n",
"\n",
"1. (Current) paired check\n",
"2. Boolean check (intersects any?)\n",
"3. List of indices\n",
"\n",
"Other option is to ignore second option and point to `~result_indices.isna()` for `any`.\n",
"\n",
"### Implementation\n",
"\n",
"Implementation could be done on `GeometryArray` level using `query_bulk` as following:"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
"# temporary wrap in GeoSeries, assuming #1444 will be merged and sindex will be on array level\n",
"inp, res = gpd.GeoSeries(pt).sindex.query_bulk(pl, predicate='intersects')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Boolean check:"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {
"tags": []
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": "array([False, True, True, True, True, True, True, True, True,\n True, True, True, False, False, False, True, False, True,\n True, True, True, True, False, True, True, True, True,\n False, True, True, True, True, True, True, True, True,\n True, True, True, True, True, True, False, False, True,\n True, False, False, False, False, True, True, True, True,\n True, True, True, True, True, False, False, True, False,\n True, True, True, False, True, False, True, True, True,\n True, True, False, False, False, False, True, False, True,\n True, False, True, False, False, True, False, True, True,\n True, True, True, True, True, True, False, True, True,\n True])"
},
"metadata": {},
"execution_count": 34
}
],
"source": [
"# get boolean (does it intersect any?)\n",
"result = np.isin(np.arange(0, len(pl)), inp)\n",
"result"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"List of indices:"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {
"tags": []
},
"outputs": [
{
"output_type": "execute_result",
"data": {
"text/plain": "array([None, array([14]), array([29, 37, 21]), array([63, 46, 35]),\n array([11]), array([22]), array([97]), array([54]), array([45]),\n array([29, 37, 41]), array([11, 33, 15]), array([64, 32, 90, 51]),\n None, None, None, array([9, 4]), None, array([35]), array([77]),\n array([91]), array([34]), array([51]), None, array([72]),\n array([89, 77]), array([96, 53]), array([34]), None,\n array([12, 43]), array([38, 95]), array([63, 35]), array([68, 75]),\n array([68]), array([81]), array([71]), array([25, 57]),\n array([89, 77]), array([9, 4]), array([42]), array([30]),\n array([17, 40]), array([52]), None, None,\n array([90, 24, 38, 95, 43]), array([51]), None, None, None, None,\n array([47]), array([59]), array([14, 79]), array([11, 15]),\n array([87, 49]), array([61, 30]), array([13, 86]), array([36]),\n array([76, 88, 25]), None, None, array([90, 24, 38, 95]), None,\n array([80]), array([97]), array([96, 53]), None, array([73]), None,\n array([13, 86]), array([14, 93]), array([57, 6, 50]),\n array([59, 20]), array([ 8, 69]), None, None, None, None,\n array([57, 6, 50]), None, array([21, 17]), array([94, 76]), None,\n array([52]), None, None, array([21, 17]), None, array([48, 78]),\n array([52]), array([56, 79]), array([16]), array([82, 0, 7, 81]),\n array([97]), array([5]), array([66, 29, 62]), None,\n array([29, 62, 41]), array([83]), array([13, 86])], dtype=object)"
},
"metadata": {},
"execution_count": 25
}
],
"source": [
"# get indices\n",
"unique, counts = np.unique(inp, return_counts=True)\n",
"values = np.split(res, np.cumsum(counts)[:-1])\n",
"indices = np.empty(len(pl), dtype='object')\n",
"indices[unique] = values\n",
"indices"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Benchmark\n",
"\n",
"Excluding `query_bulk` to compare two options.\n",
"\n",
"100 000 points and 10 000 polygons."
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [],
"source": [
"pts = [Point(randint(0, 100), randint(0, 100)) for _ in range(100000)]\n",
"polys = [Point(randint(0, 100), randint(0, 100)).buffer(5, cap_style=3) for _ in range(10000)]\n",
"\n",
"pt = gpd.array.from_shapely(pts)\n",
"pl = gpd.array.from_shapely(polys)"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [],
"source": [
"inp, res = gpd.GeoSeries(pt).sindex.query_bulk(pl, predicate='intersects')"
]
},
{
"cell_type": "code",
"execution_count": 48,
"metadata": {
"tags": []
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": "170 ms ± 7.64 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
}
],
"source": [
"%%timeit\n",
"# get boolean (does it intersect any?)\n",
"result = np.isin(np.arange(0, len(pl)), inp)"
]
},
{
"cell_type": "code",
"execution_count": 49,
"metadata": {
"tags": []
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": "191 ms ± 5.35 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
}
],
"source": [
"%%timeit\n",
"# get indices\n",
"unique, counts = np.unique(inp, return_counts=True)\n",
"values = np.split(res, np.cumsum(counts)[:-1])\n",
"indices = np.empty(len(pl), dtype='object')\n",
"indices[unique] = values"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Including `query_bulk`."
]
},
{
"cell_type": "code",
"execution_count": 50,
"metadata": {
"tags": []
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": "10.6 s ± 159 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
}
],
"source": [
"%%timeit\n",
"inp, res = gpd.GeoSeries(pt).sindex.query_bulk(pl, predicate='intersects')\n",
"# get boolean (does it intersect any?)\n",
"result = np.isin(np.arange(0, len(pl)), inp)"
]
},
{
"cell_type": "code",
"execution_count": 51,
"metadata": {
"tags": []
},
"outputs": [
{
"output_type": "stream",
"name": "stdout",
"text": "10.5 s ± 156 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
}
],
"source": [
"%%timeit\n",
"inp, res = gpd.GeoSeries(pt).sindex.query_bulk(pl, predicate='intersects')\n",
"# get indices\n",
"unique, counts = np.unique(inp, return_counts=True)\n",
"values = np.split(res, np.cumsum(counts)[:-1])\n",
"indices = np.empty(len(pl), dtype='object')\n",
"indices[unique] = values"
]
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment