martinfleis/predicates.ipynb

## predicates.ipynb
{
 "metadata": {
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6-final"
  },
  "orig_nbformat": 2,
  "kernelspec": {
   "name": "geo_dev",
   "display_name": "geo_dev"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2,
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Binary predicates in GeoPandas 1.0 - proposal\n",
    "\n",
    "Current behaviour of binary predicates in GeoPandas (e.g. `series1.intersects(sereis2)`) is not expected by many users (#972, #1316, #1422) and the expected one is not available. This notebook illustrates a proposal for a new behaviour, which should be implemented in 0.9 release and become default in 1.0."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "from random import randint\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import geopandas as gpd\n",
    "\n",
    "from shapely.geometry import Point"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "pts = [Point(randint(0, 100), randint(0, 100)) for _ in range(100)]\n",
    "polys = [Point(randint(0, 100), randint(0, 100)).buffer(5, cap_style=3) for _ in range(100)]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "pt = gpd.array.from_shapely(pts)\n",
    "pl = gpd.array.from_shapely(polys)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Current behaviour\n",
    "\n",
    "Current behaviour is pair-wise. GeoSeries gets aligned and binary operation is done for each row."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": "array([False, False, False, False, False, False, False, False, False,\n       False, False, False, False, False, False, False, False, False,\n       False, False, False, False, False, False, False, False, False,\n       False, False, False, False, False, False, False, False, False,\n       False, False, False, False,  True, False, False, False, False,\n       False, False, False, False, False, False, False, False, False,\n       False, False, False, False, False, False, False, False, False,\n       False, False, False, False, False, False, False, False, False,\n       False, False, False, False, False, False, False, False, False,\n       False, False, False, False, False, False, False, False, False,\n       False, False, False, False, False, False, False, False, False,\n       False])"
     },
     "metadata": {},
     "execution_count": 22
    }
   ],
   "source": [
    "pl.intersects(pt)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Users expect two kinds of behaviour:\n",
    "\n",
    "1. Does each object in `left` intersect _any_ object in `right`?\n",
    "2. Which objects from `right` intersect each object from `left`? - note that this is behaviour of `sf` and PostGIS (I think).\n",
    "\n",
    "## Proposal\n",
    "\n",
    "New kewyords, controlling the behaviour switching between current behaviour, boolean check against all geometries and list of indices. Not sure about the ideal way to handle it, see [#972](https://github.com/geopandas/geopandas/issues/972) for ongoing discussion. Users could then choose between:\n",
    "\n",
    "1. (Current) paired check\n",
    "2. Boolean check (intersects any?)\n",
    "3. List of indices\n",
    "\n",
    "Other option is to ignore second option and point to `~result_indices.isna()` for `any`.\n",
    "\n",
    "### Implementation\n",
    "\n",
    "Implementation could be done on `GeometryArray` level using `query_bulk` as following:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "# temporary wrap in GeoSeries, assuming #1444 will be merged and sindex will be on array level\n",
    "inp, res = gpd.GeoSeries(pt).sindex.query_bulk(pl, predicate='intersects')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Boolean check:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": "array([False,  True,  True,  True,  True,  True,  True,  True,  True,\n        True,  True,  True, False, False, False,  True, False,  True,\n        True,  True,  True,  True, False,  True,  True,  True,  True,\n       False,  True,  True,  True,  True,  True,  True,  True,  True,\n        True,  True,  True,  True,  True,  True, False, False,  True,\n        True, False, False, False, False,  True,  True,  True,  True,\n        True,  True,  True,  True,  True, False, False,  True, False,\n        True,  True,  True, False,  True, False,  True,  True,  True,\n        True,  True, False, False, False, False,  True, False,  True,\n        True, False,  True, False, False,  True, False,  True,  True,\n        True,  True,  True,  True,  True,  True, False,  True,  True,\n        True])"
     },
     "metadata": {},
     "execution_count": 34
    }
   ],
   "source": [
    "# get boolean (does it intersect any?)\n",
    "result = np.isin(np.arange(0, len(pl)), inp)\n",
    "result"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "List of indices:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "output_type": "execute_result",
     "data": {
      "text/plain": "array([None, array([14]), array([29, 37, 21]), array([63, 46, 35]),\n       array([11]), array([22]), array([97]), array([54]), array([45]),\n       array([29, 37, 41]), array([11, 33, 15]), array([64, 32, 90, 51]),\n       None, None, None, array([9, 4]), None, array([35]), array([77]),\n       array([91]), array([34]), array([51]), None, array([72]),\n       array([89, 77]), array([96, 53]), array([34]), None,\n       array([12, 43]), array([38, 95]), array([63, 35]), array([68, 75]),\n       array([68]), array([81]), array([71]), array([25, 57]),\n       array([89, 77]), array([9, 4]), array([42]), array([30]),\n       array([17, 40]), array([52]), None, None,\n       array([90, 24, 38, 95, 43]), array([51]), None, None, None, None,\n       array([47]), array([59]), array([14, 79]), array([11, 15]),\n       array([87, 49]), array([61, 30]), array([13, 86]), array([36]),\n       array([76, 88, 25]), None, None, array([90, 24, 38, 95]), None,\n       array([80]), array([97]), array([96, 53]), None, array([73]), None,\n       array([13, 86]), array([14, 93]), array([57,  6, 50]),\n       array([59, 20]), array([ 8, 69]), None, None, None, None,\n       array([57,  6, 50]), None, array([21, 17]), array([94, 76]), None,\n       array([52]), None, None, array([21, 17]), None, array([48, 78]),\n       array([52]), array([56, 79]), array([16]), array([82,  0,  7, 81]),\n       array([97]), array([5]), array([66, 29, 62]), None,\n       array([29, 62, 41]), array([83]), array([13, 86])], dtype=object)"
     },
     "metadata": {},
     "execution_count": 25
    }
   ],
   "source": [
    "# get indices\n",
    "unique, counts = np.unique(inp, return_counts=True)\n",
    "values = np.split(res, np.cumsum(counts)[:-1])\n",
    "indices = np.empty(len(pl), dtype='object')\n",
    "indices[unique] = values\n",
    "indices"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Benchmark\n",
    "\n",
    "Excluding `query_bulk` to compare two options.\n",
    "\n",
    "100 000 points and 10 000 polygons."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 46,
   "metadata": {},
   "outputs": [],
   "source": [
    "pts = [Point(randint(0, 100), randint(0, 100)) for _ in range(100000)]\n",
    "polys = [Point(randint(0, 100), randint(0, 100)).buffer(5, cap_style=3) for _ in range(10000)]\n",
    "\n",
    "pt = gpd.array.from_shapely(pts)\n",
    "pl = gpd.array.from_shapely(polys)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [],
   "source": [
    "inp, res = gpd.GeoSeries(pt).sindex.query_bulk(pl, predicate='intersects')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": "170 ms ± 7.64 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
    }
   ],
   "source": [
    "%%timeit\n",
    "# get boolean (does it intersect any?)\n",
    "result = np.isin(np.arange(0, len(pl)), inp)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": "191 ms ± 5.35 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
    }
   ],
   "source": [
    "%%timeit\n",
    "# get indices\n",
    "unique, counts = np.unique(inp, return_counts=True)\n",
    "values = np.split(res, np.cumsum(counts)[:-1])\n",
    "indices = np.empty(len(pl), dtype='object')\n",
    "indices[unique] = values"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Including `query_bulk`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": "10.6 s ± 159 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
    }
   ],
   "source": [
    "%%timeit\n",
    "inp, res = gpd.GeoSeries(pt).sindex.query_bulk(pl, predicate='intersects')\n",
    "# get boolean (does it intersect any?)\n",
    "result = np.isin(np.arange(0, len(pl)), inp)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 51,
   "metadata": {
    "tags": []
   },
   "outputs": [
    {
     "output_type": "stream",
     "name": "stdout",
     "text": "10.5 s ± 156 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
    }
   ],
   "source": [
    "%%timeit\n",
    "inp, res = gpd.GeoSeries(pt).sindex.query_bulk(pl, predicate='intersects')\n",
    "# get indices\n",
    "unique, counts = np.unique(inp, return_counts=True)\n",
    "values = np.split(res, np.cumsum(counts)[:-1])\n",
    "indices = np.empty(len(pl), dtype='object')\n",
    "indices[unique] = values"
   ]
  }
 ]
}
	{
	"metadata": {
	"language_info": {
	"codemirror_mode": {
	"name": "ipython",
	"version": 3
	},
	"file_extension": ".py",
	"mimetype": "text/x-python",
	"name": "python",
	"nbconvert_exporter": "python",
	"pygments_lexer": "ipython3",
	"version": "3.7.6-final"
	},
	"orig_nbformat": 2,
	"kernelspec": {
	"name": "geo_dev",
	"display_name": "geo_dev"
	}
	},
	"nbformat": 4,
	"nbformat_minor": 2,
	"cells": [
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"# Binary predicates in GeoPandas 1.0 - proposal\n",
	"\n",
	"Current behaviour of binary predicates in GeoPandas (e.g. `series1.intersects(sereis2)`) is not expected by many users (#972, #1316, #1422) and the expected one is not available. This notebook illustrates a proposal for a new behaviour, which should be implemented in 0.9 release and become default in 1.0."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 10,
	"metadata": {},
	"outputs": [],
	"source": [
	"from random import randint\n",
	"\n",
	"import numpy as np\n",
	"import pandas as pd\n",
	"import geopandas as gpd\n",
	"\n",
	"from shapely.geometry import Point"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 20,
	"metadata": {},
	"outputs": [],
	"source": [
	"pts = [Point(randint(0, 100), randint(0, 100)) for _ in range(100)]\n",
	"polys = [Point(randint(0, 100), randint(0, 100)).buffer(5, cap_style=3) for _ in range(100)]"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 21,
	"metadata": {},
	"outputs": [],
	"source": [
	"pt = gpd.array.from_shapely(pts)\n",
	"pl = gpd.array.from_shapely(polys)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Current behaviour\n",
	"\n",
	"Current behaviour is pair-wise. GeoSeries gets aligned and binary operation is done for each row."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 22,
	"metadata": {},
	"outputs": [
	{
	"output_type": "execute_result",
	"data": {
	"text/plain": "array([False, False, False, False, False, False, False, False, False,\n False, False, False, False, False, False, False, False, False,\n False, False, False, False, False, False, False, False, False,\n False, False, False, False, False, False, False, False, False,\n False, False, False, False, True, False, False, False, False,\n False, False, False, False, False, False, False, False, False,\n False, False, False, False, False, False, False, False, False,\n False, False, False, False, False, False, False, False, False,\n False, False, False, False, False, False, False, False, False,\n False, False, False, False, False, False, False, False, False,\n False, False, False, False, False, False, False, False, False,\n False])"
	},
	"metadata": {},
	"execution_count": 22
	}
	],
	"source": [
	"pl.intersects(pt)"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Users expect two kinds of behaviour:\n",
	"\n",
	"1. Does each object in `left` intersect _any_ object in `right`?\n",
	"2. Which objects from `right` intersect each object from `left`? - note that this is behaviour of `sf` and PostGIS (I think).\n",
	"\n",
	"## Proposal\n",
	"\n",
	"New kewyords, controlling the behaviour switching between current behaviour, boolean check against all geometries and list of indices. Not sure about the ideal way to handle it, see [#972](https://github.com/geopandas/geopandas/issues/972) for ongoing discussion. Users could then choose between:\n",
	"\n",
	"1. (Current) paired check\n",
	"2. Boolean check (intersects any?)\n",
	"3. List of indices\n",
	"\n",
	"Other option is to ignore second option and point to `~result_indices.isna()` for `any`.\n",
	"\n",
	"### Implementation\n",
	"\n",
	"Implementation could be done on `GeometryArray` level using `query_bulk` as following:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 23,
	"metadata": {},
	"outputs": [],
	"source": [
	"# temporary wrap in GeoSeries, assuming #1444 will be merged and sindex will be on array level\n",
	"inp, res = gpd.GeoSeries(pt).sindex.query_bulk(pl, predicate='intersects')"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Boolean check:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 34,
	"metadata": {
	"tags": []
	},
	"outputs": [
	{
	"output_type": "execute_result",
	"data": {
	"text/plain": "array([False, True, True, True, True, True, True, True, True,\n True, True, True, False, False, False, True, False, True,\n True, True, True, True, False, True, True, True, True,\n False, True, True, True, True, True, True, True, True,\n True, True, True, True, True, True, False, False, True,\n True, False, False, False, False, True, True, True, True,\n True, True, True, True, True, False, False, True, False,\n True, True, True, False, True, False, True, True, True,\n True, True, False, False, False, False, True, False, True,\n True, False, True, False, False, True, False, True, True,\n True, True, True, True, True, True, False, True, True,\n True])"
	},
	"metadata": {},
	"execution_count": 34
	}
	],
	"source": [
	"# get boolean (does it intersect any?)\n",
	"result = np.isin(np.arange(0, len(pl)), inp)\n",
	"result"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"List of indices:"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 25,
	"metadata": {
	"tags": []
	},
	"outputs": [
	{
	"output_type": "execute_result",
	"data": {
	"text/plain": "array([None, array([14]), array([29, 37, 21]), array([63, 46, 35]),\n array([11]), array([22]), array([97]), array([54]), array([45]),\n array([29, 37, 41]), array([11, 33, 15]), array([64, 32, 90, 51]),\n None, None, None, array([9, 4]), None, array([35]), array([77]),\n array([91]), array([34]), array([51]), None, array([72]),\n array([89, 77]), array([96, 53]), array([34]), None,\n array([12, 43]), array([38, 95]), array([63, 35]), array([68, 75]),\n array([68]), array([81]), array([71]), array([25, 57]),\n array([89, 77]), array([9, 4]), array([42]), array([30]),\n array([17, 40]), array([52]), None, None,\n array([90, 24, 38, 95, 43]), array([51]), None, None, None, None,\n array([47]), array([59]), array([14, 79]), array([11, 15]),\n array([87, 49]), array([61, 30]), array([13, 86]), array([36]),\n array([76, 88, 25]), None, None, array([90, 24, 38, 95]), None,\n array([80]), array([97]), array([96, 53]), None, array([73]), None,\n array([13, 86]), array([14, 93]), array([57, 6, 50]),\n array([59, 20]), array([ 8, 69]), None, None, None, None,\n array([57, 6, 50]), None, array([21, 17]), array([94, 76]), None,\n array([52]), None, None, array([21, 17]), None, array([48, 78]),\n array([52]), array([56, 79]), array([16]), array([82, 0, 7, 81]),\n array([97]), array([5]), array([66, 29, 62]), None,\n array([29, 62, 41]), array([83]), array([13, 86])], dtype=object)"
	},
	"metadata": {},
	"execution_count": 25
	}
	],
	"source": [
	"# get indices\n",
	"unique, counts = np.unique(inp, return_counts=True)\n",
	"values = np.split(res, np.cumsum(counts)[:-1])\n",
	"indices = np.empty(len(pl), dtype='object')\n",
	"indices[unique] = values\n",
	"indices"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"## Benchmark\n",
	"\n",
	"Excluding `query_bulk` to compare two options.\n",
	"\n",
	"100 000 points and 10 000 polygons."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 46,
	"metadata": {},
	"outputs": [],
	"source": [
	"pts = [Point(randint(0, 100), randint(0, 100)) for _ in range(100000)]\n",
	"polys = [Point(randint(0, 100), randint(0, 100)).buffer(5, cap_style=3) for _ in range(10000)]\n",
	"\n",
	"pt = gpd.array.from_shapely(pts)\n",
	"pl = gpd.array.from_shapely(polys)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 47,
	"metadata": {},
	"outputs": [],
	"source": [
	"inp, res = gpd.GeoSeries(pt).sindex.query_bulk(pl, predicate='intersects')"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 48,
	"metadata": {
	"tags": []
	},
	"outputs": [
	{
	"output_type": "stream",
	"name": "stdout",
	"text": "170 ms ± 7.64 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
	}
	],
	"source": [
	"%%timeit\n",
	"# get boolean (does it intersect any?)\n",
	"result = np.isin(np.arange(0, len(pl)), inp)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 49,
	"metadata": {
	"tags": []
	},
	"outputs": [
	{
	"output_type": "stream",
	"name": "stdout",
	"text": "191 ms ± 5.35 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
	}
	],
	"source": [
	"%%timeit\n",
	"# get indices\n",
	"unique, counts = np.unique(inp, return_counts=True)\n",
	"values = np.split(res, np.cumsum(counts)[:-1])\n",
	"indices = np.empty(len(pl), dtype='object')\n",
	"indices[unique] = values"
	]
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Including `query_bulk`."
	]
	},
	{
	"cell_type": "code",
	"execution_count": 50,
	"metadata": {
	"tags": []
	},
	"outputs": [
	{
	"output_type": "stream",
	"name": "stdout",
	"text": "10.6 s ± 159 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
	}
	],
	"source": [
	"%%timeit\n",
	"inp, res = gpd.GeoSeries(pt).sindex.query_bulk(pl, predicate='intersects')\n",
	"# get boolean (does it intersect any?)\n",
	"result = np.isin(np.arange(0, len(pl)), inp)"
	]
	},
	{
	"cell_type": "code",
	"execution_count": 51,
	"metadata": {
	"tags": []
	},
	"outputs": [
	{
	"output_type": "stream",
	"name": "stdout",
	"text": "10.5 s ± 156 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
	}
	],
	"source": [
	"%%timeit\n",
	"inp, res = gpd.GeoSeries(pt).sindex.query_bulk(pl, predicate='intersects')\n",
	"# get indices\n",
	"unique, counts = np.unique(inp, return_counts=True)\n",
	"values = np.split(res, np.cumsum(counts)[:-1])\n",
	"indices = np.empty(len(pl), dtype='object')\n",
	"indices[unique] = values"
	]
	}
	]
	}