Skip to content

Instantly share code, notes, and snippets.

@ThomasG77
Forked from martinfleis/philly_sjoin.ipynb
Created March 11, 2022 22:06
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ThomasG77/8be3539eabe0a1c2e07e749fb3ee6372 to your computer and use it in GitHub Desktop.
Save ThomasG77/8be3539eabe0a1c2e07e749fb3ee6372 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"id": "07bcf8b0-616e-46de-9a9e-a9b2a6170b14",
"metadata": {},
"source": [
"# Dask-GeoPandas vs PostGIS vs GPU: Performance and Spatial Joins\n",
"\n",
"Paul Ramsey saw a spatial join done [using a GPU](https://medium.com/swlh/how-to-perform-fast-and-powerful-geospatial-data-analysis-with-gpu-48f16a168b10) and tried to [do the same with PostGIS](https://blog.crunchydata.com/blog/performance-and-spatial-joins), checking how fast that is compared to the GPU-based RAPIDS.AI solution. Since Paul used parallelisation in PostGIS, I got curious how fast [Dask-GeoPandas](https://github.com/geopandas/dask-geopandas) is on the same task.\n",
"\n",
"So, I gave it a go."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "fbb29933-26d4-4d0d-b42e-d58490b773e0",
"metadata": {},
"outputs": [],
"source": [
"import download\n",
"import geopandas\n",
"import dask_geopandas\n",
"import dask.dataframe\n",
"from dask.distributed import Client, LocalCluster"
]
},
{
"cell_type": "markdown",
"id": "7a6d94d9-7514-4403-8861-c41c59a0549c",
"metadata": {},
"source": [
"Let's download the data using Paul's query, to ensure we work with the same CSV."
]
},
{
"cell_type": "raw",
"id": "e8192004-07ea-45f9-be2a-ab9defe52f75",
"metadata": {},
"source": [
"curl \"https://phl.carto.com/api/v2/sql?filename=parking_violations&format=csv&skipfields=cartodb_id,the_geom,the_geom_webmercator&q=SELECT%20*%20FROM%20parking_violations%20WHERE%20issue_datetime%20%3E=%20%272012-01-01%27%20AND%20issue_datetime%20%3C%20%272017-12-31%27\" > phl_parking.csv\n"
]
},
{
"cell_type": "markdown",
"id": "7387f030-6598-4d10-a5da-c99f775bb2a1",
"metadata": {},
"source": [
"And then download and unzip the neighbourhoods shapefile."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "a3db1005-63b5-4af4-9770-521eb15059fe",
"metadata": {},
"outputs": [],
"source": [
"download.download(\n",
" \"https://github.com/azavea/geo-data/raw/master/Neighborhoods_Philadelphia/Neighborhoods_Philadelphia.zip\",\n",
" \"Neighborhoods_Philadelphia\", \n",
" kind=\"zip\"\n",
")"
]
},
{
"cell_type": "markdown",
"id": "effec323-54de-4a42-9434-a43bad28eea9",
"metadata": {},
"source": [
"Paul used a machine with 8 cores. Since I am on one with 16, I'll create a local cluster limited to 8 workers. That should be as close to Paul's machine as I can get without using some virtual one. Keep in mind that this distorts the benchmark as we use different processors with different performance. But the point here is to get a sense how fast can Dask-based solution be compared to PostGIS and the original GPU code."
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "268d9dd2-6aba-44aa-8993-864a1c707249",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
" <div style=\"width: 24px; height: 24px; background-color: #e1e1e1; border: 3px solid #9D9D9D; border-radius: 5px; position: absolute;\"> </div>\n",
" <div style=\"margin-left: 48px;\">\n",
" <h3 style=\"margin-bottom: 0px;\">Client</h3>\n",
" <p style=\"color: #9D9D9D; margin-bottom: 0px;\">Client-65779fdb-a0b9-11ec-9804-6b224c1cce8a</p>\n",
" <table style=\"width: 100%; text-align: left;\">\n",
"\n",
" <tr>\n",
" \n",
" <td style=\"text-align: left;\"><strong>Connection method:</strong> Cluster object</td>\n",
" <td style=\"text-align: left;\"><strong>Cluster type:</strong> distributed.LocalCluster</td>\n",
" \n",
" </tr>\n",
"\n",
" \n",
" <tr>\n",
" <td style=\"text-align: left;\">\n",
" <strong>Dashboard: </strong> <a href=\"http://127.0.0.1:8787/status\" target=\"_blank\">http://127.0.0.1:8787/status</a>\n",
" </td>\n",
" <td style=\"text-align: left;\"></td>\n",
" </tr>\n",
" \n",
"\n",
" </table>\n",
"\n",
" \n",
" <details>\n",
" <summary style=\"margin-bottom: 20px;\"><h3 style=\"display: inline;\">Cluster Info</h3></summary>\n",
" <div class=\"jp-RenderedHTMLCommon jp-RenderedHTML jp-mod-trusted jp-OutputArea-output\">\n",
" <div style=\"width: 24px; height: 24px; background-color: #e1e1e1; border: 3px solid #9D9D9D; border-radius: 5px; position: absolute;\">\n",
" </div>\n",
" <div style=\"margin-left: 48px;\">\n",
" <h3 style=\"margin-bottom: 0px; margin-top: 0px;\">LocalCluster</h3>\n",
" <p style=\"color: #9D9D9D; margin-bottom: 0px;\">826ebc2f</p>\n",
" <table style=\"width: 100%; text-align: left;\">\n",
" <tr>\n",
" <td style=\"text-align: left;\">\n",
" <strong>Dashboard:</strong> <a href=\"http://127.0.0.1:8787/status\" target=\"_blank\">http://127.0.0.1:8787/status</a>\n",
" </td>\n",
" <td style=\"text-align: left;\">\n",
" <strong>Workers:</strong> 8\n",
" </td>\n",
" </tr>\n",
" <tr>\n",
" <td style=\"text-align: left;\">\n",
" <strong>Total threads:</strong> 8\n",
" </td>\n",
" <td style=\"text-align: left;\">\n",
" <strong>Total memory:</strong> 125.54 GiB\n",
" </td>\n",
" </tr>\n",
" \n",
" <tr>\n",
" <td style=\"text-align: left;\"><strong>Status:</strong> running</td>\n",
" <td style=\"text-align: left;\"><strong>Using processes:</strong> True</td>\n",
"</tr>\n",
"\n",
" \n",
" </table>\n",
"\n",
" <details>\n",
" <summary style=\"margin-bottom: 20px;\">\n",
" <h3 style=\"display: inline;\">Scheduler Info</h3>\n",
" </summary>\n",
"\n",
" <div style=\"\">\n",
" <div>\n",
" <div style=\"width: 24px; height: 24px; background-color: #FFF7E5; border: 3px solid #FF6132; border-radius: 5px; position: absolute;\"> </div>\n",
" <div style=\"margin-left: 48px;\">\n",
" <h3 style=\"margin-bottom: 0px;\">Scheduler</h3>\n",
" <p style=\"color: #9D9D9D; margin-bottom: 0px;\">Scheduler-d7fc5983-16a7-4833-af77-ca78f7e1696c</p>\n",
" <table style=\"width: 100%; text-align: left;\">\n",
" <tr>\n",
" <td style=\"text-align: left;\">\n",
" <strong>Comm:</strong> tcp://127.0.0.1:35287\n",
" </td>\n",
" <td style=\"text-align: left;\">\n",
" <strong>Workers:</strong> 8\n",
" </td>\n",
" </tr>\n",
" <tr>\n",
" <td style=\"text-align: left;\">\n",
" <strong>Dashboard:</strong> <a href=\"http://127.0.0.1:8787/status\" target=\"_blank\">http://127.0.0.1:8787/status</a>\n",
" </td>\n",
" <td style=\"text-align: left;\">\n",
" <strong>Total threads:</strong> 8\n",
" </td>\n",
" </tr>\n",
" <tr>\n",
" <td style=\"text-align: left;\">\n",
" <strong>Started:</strong> Just now\n",
" </td>\n",
" <td style=\"text-align: left;\">\n",
" <strong>Total memory:</strong> 125.54 GiB\n",
" </td>\n",
" </tr>\n",
" </table>\n",
" </div>\n",
" </div>\n",
"\n",
" <details style=\"margin-left: 48px;\">\n",
" <summary style=\"margin-bottom: 20px;\">\n",
" <h3 style=\"display: inline;\">Workers</h3>\n",
" </summary>\n",
"\n",
" \n",
" <div style=\"margin-bottom: 20px;\">\n",
" <div style=\"width: 24px; height: 24px; background-color: #DBF5FF; border: 3px solid #4CC9FF; border-radius: 5px; position: absolute;\"> </div>\n",
" <div style=\"margin-left: 48px;\">\n",
" <details>\n",
" <summary>\n",
" <h4 style=\"margin-bottom: 0px; display: inline;\">Worker: 0</h4>\n",
" </summary>\n",
" <table style=\"width: 100%; text-align: left;\">\n",
" <tr>\n",
" <td style=\"text-align: left;\">\n",
" <strong>Comm: </strong> tcp://127.0.0.1:44711\n",
" </td>\n",
" <td style=\"text-align: left;\">\n",
" <strong>Total threads: </strong> 1\n",
" </td>\n",
" </tr>\n",
" <tr>\n",
" <td style=\"text-align: left;\">\n",
" <strong>Dashboard: </strong> <a href=\"http://127.0.0.1:44519/status\" target=\"_blank\">http://127.0.0.1:44519/status</a>\n",
" </td>\n",
" <td style=\"text-align: left;\">\n",
" <strong>Memory: </strong> 15.69 GiB\n",
" </td>\n",
" </tr>\n",
" <tr>\n",
" <td style=\"text-align: left;\">\n",
" <strong>Nanny: </strong> tcp://127.0.0.1:38075\n",
" </td>\n",
" <td style=\"text-align: left;\"></td>\n",
" </tr>\n",
" <tr>\n",
" <td colspan=\"2\" style=\"text-align: left;\">\n",
" <strong>Local directory: </strong> /home/jovyan/work/dask-worker-space/worker-y2nlzsdc\n",
" </td>\n",
" </tr>\n",
"\n",
" \n",
"\n",
" \n",
"\n",
" </table>\n",
" </details>\n",
" </div>\n",
" </div>\n",
" \n",
" <div style=\"margin-bottom: 20px;\">\n",
" <div style=\"width: 24px; height: 24px; background-color: #DBF5FF; border: 3px solid #4CC9FF; border-radius: 5px; position: absolute;\"> </div>\n",
" <div style=\"margin-left: 48px;\">\n",
" <details>\n",
" <summary>\n",
" <h4 style=\"margin-bottom: 0px; display: inline;\">Worker: 1</h4>\n",
" </summary>\n",
" <table style=\"width: 100%; text-align: left;\">\n",
" <tr>\n",
" <td style=\"text-align: left;\">\n",
" <strong>Comm: </strong> tcp://127.0.0.1:39219\n",
" </td>\n",
" <td style=\"text-align: left;\">\n",
" <strong>Total threads: </strong> 1\n",
" </td>\n",
" </tr>\n",
" <tr>\n",
" <td style=\"text-align: left;\">\n",
" <strong>Dashboard: </strong> <a href=\"http://127.0.0.1:37481/status\" target=\"_blank\">http://127.0.0.1:37481/status</a>\n",
" </td>\n",
" <td style=\"text-align: left;\">\n",
" <strong>Memory: </strong> 15.69 GiB\n",
" </td>\n",
" </tr>\n",
" <tr>\n",
" <td style=\"text-align: left;\">\n",
" <strong>Nanny: </strong> tcp://127.0.0.1:36679\n",
" </td>\n",
" <td style=\"text-align: left;\"></td>\n",
" </tr>\n",
" <tr>\n",
" <td colspan=\"2\" style=\"text-align: left;\">\n",
" <strong>Local directory: </strong> /home/jovyan/work/dask-worker-space/worker-olpznos9\n",
" </td>\n",
" </tr>\n",
"\n",
" \n",
"\n",
" \n",
"\n",
" </table>\n",
" </details>\n",
" </div>\n",
" </div>\n",
" \n",
" <div style=\"margin-bottom: 20px;\">\n",
" <div style=\"width: 24px; height: 24px; background-color: #DBF5FF; border: 3px solid #4CC9FF; border-radius: 5px; position: absolute;\"> </div>\n",
" <div style=\"margin-left: 48px;\">\n",
" <details>\n",
" <summary>\n",
" <h4 style=\"margin-bottom: 0px; display: inline;\">Worker: 2</h4>\n",
" </summary>\n",
" <table style=\"width: 100%; text-align: left;\">\n",
" <tr>\n",
" <td style=\"text-align: left;\">\n",
" <strong>Comm: </strong> tcp://127.0.0.1:41497\n",
" </td>\n",
" <td style=\"text-align: left;\">\n",
" <strong>Total threads: </strong> 1\n",
" </td>\n",
" </tr>\n",
" <tr>\n",
" <td style=\"text-align: left;\">\n",
" <strong>Dashboard: </strong> <a href=\"http://127.0.0.1:44973/status\" target=\"_blank\">http://127.0.0.1:44973/status</a>\n",
" </td>\n",
" <td style=\"text-align: left;\">\n",
" <strong>Memory: </strong> 15.69 GiB\n",
" </td>\n",
" </tr>\n",
" <tr>\n",
" <td style=\"text-align: left;\">\n",
" <strong>Nanny: </strong> tcp://127.0.0.1:43059\n",
" </td>\n",
" <td style=\"text-align: left;\"></td>\n",
" </tr>\n",
" <tr>\n",
" <td colspan=\"2\" style=\"text-align: left;\">\n",
" <strong>Local directory: </strong> /home/jovyan/work/dask-worker-space/worker-_4ta030_\n",
" </td>\n",
" </tr>\n",
"\n",
" \n",
"\n",
" \n",
"\n",
" </table>\n",
" </details>\n",
" </div>\n",
" </div>\n",
" \n",
" <div style=\"margin-bottom: 20px;\">\n",
" <div style=\"width: 24px; height: 24px; background-color: #DBF5FF; border: 3px solid #4CC9FF; border-radius: 5px; position: absolute;\"> </div>\n",
" <div style=\"margin-left: 48px;\">\n",
" <details>\n",
" <summary>\n",
" <h4 style=\"margin-bottom: 0px; display: inline;\">Worker: 3</h4>\n",
" </summary>\n",
" <table style=\"width: 100%; text-align: left;\">\n",
" <tr>\n",
" <td style=\"text-align: left;\">\n",
" <strong>Comm: </strong> tcp://127.0.0.1:37651\n",
" </td>\n",
" <td style=\"text-align: left;\">\n",
" <strong>Total threads: </strong> 1\n",
" </td>\n",
" </tr>\n",
" <tr>\n",
" <td style=\"text-align: left;\">\n",
" <strong>Dashboard: </strong> <a href=\"http://127.0.0.1:41377/status\" target=\"_blank\">http://127.0.0.1:41377/status</a>\n",
" </td>\n",
" <td style=\"text-align: left;\">\n",
" <strong>Memory: </strong> 15.69 GiB\n",
" </td>\n",
" </tr>\n",
" <tr>\n",
" <td style=\"text-align: left;\">\n",
" <strong>Nanny: </strong> tcp://127.0.0.1:39611\n",
" </td>\n",
" <td style=\"text-align: left;\"></td>\n",
" </tr>\n",
" <tr>\n",
" <td colspan=\"2\" style=\"text-align: left;\">\n",
" <strong>Local directory: </strong> /home/jovyan/work/dask-worker-space/worker-2djgim9g\n",
" </td>\n",
" </tr>\n",
"\n",
" \n",
"\n",
" \n",
"\n",
" </table>\n",
" </details>\n",
" </div>\n",
" </div>\n",
" \n",
" <div style=\"margin-bottom: 20px;\">\n",
" <div style=\"width: 24px; height: 24px; background-color: #DBF5FF; border: 3px solid #4CC9FF; border-radius: 5px; position: absolute;\"> </div>\n",
" <div style=\"margin-left: 48px;\">\n",
" <details>\n",
" <summary>\n",
" <h4 style=\"margin-bottom: 0px; display: inline;\">Worker: 4</h4>\n",
" </summary>\n",
" <table style=\"width: 100%; text-align: left;\">\n",
" <tr>\n",
" <td style=\"text-align: left;\">\n",
" <strong>Comm: </strong> tcp://127.0.0.1:34037\n",
" </td>\n",
" <td style=\"text-align: left;\">\n",
" <strong>Total threads: </strong> 1\n",
" </td>\n",
" </tr>\n",
" <tr>\n",
" <td style=\"text-align: left;\">\n",
" <strong>Dashboard: </strong> <a href=\"http://127.0.0.1:45125/status\" target=\"_blank\">http://127.0.0.1:45125/status</a>\n",
" </td>\n",
" <td style=\"text-align: left;\">\n",
" <strong>Memory: </strong> 15.69 GiB\n",
" </td>\n",
" </tr>\n",
" <tr>\n",
" <td style=\"text-align: left;\">\n",
" <strong>Nanny: </strong> tcp://127.0.0.1:46679\n",
" </td>\n",
" <td style=\"text-align: left;\"></td>\n",
" </tr>\n",
" <tr>\n",
" <td colspan=\"2\" style=\"text-align: left;\">\n",
" <strong>Local directory: </strong> /home/jovyan/work/dask-worker-space/worker-byyo7q01\n",
" </td>\n",
" </tr>\n",
"\n",
" \n",
"\n",
" \n",
"\n",
" </table>\n",
" </details>\n",
" </div>\n",
" </div>\n",
" \n",
" <div style=\"margin-bottom: 20px;\">\n",
" <div style=\"width: 24px; height: 24px; background-color: #DBF5FF; border: 3px solid #4CC9FF; border-radius: 5px; position: absolute;\"> </div>\n",
" <div style=\"margin-left: 48px;\">\n",
" <details>\n",
" <summary>\n",
" <h4 style=\"margin-bottom: 0px; display: inline;\">Worker: 5</h4>\n",
" </summary>\n",
" <table style=\"width: 100%; text-align: left;\">\n",
" <tr>\n",
" <td style=\"text-align: left;\">\n",
" <strong>Comm: </strong> tcp://127.0.0.1:39841\n",
" </td>\n",
" <td style=\"text-align: left;\">\n",
" <strong>Total threads: </strong> 1\n",
" </td>\n",
" </tr>\n",
" <tr>\n",
" <td style=\"text-align: left;\">\n",
" <strong>Dashboard: </strong> <a href=\"http://127.0.0.1:39121/status\" target=\"_blank\">http://127.0.0.1:39121/status</a>\n",
" </td>\n",
" <td style=\"text-align: left;\">\n",
" <strong>Memory: </strong> 15.69 GiB\n",
" </td>\n",
" </tr>\n",
" <tr>\n",
" <td style=\"text-align: left;\">\n",
" <strong>Nanny: </strong> tcp://127.0.0.1:45363\n",
" </td>\n",
" <td style=\"text-align: left;\"></td>\n",
" </tr>\n",
" <tr>\n",
" <td colspan=\"2\" style=\"text-align: left;\">\n",
" <strong>Local directory: </strong> /home/jovyan/work/dask-worker-space/worker-38_s7hpi\n",
" </td>\n",
" </tr>\n",
"\n",
" \n",
"\n",
" \n",
"\n",
" </table>\n",
" </details>\n",
" </div>\n",
" </div>\n",
" \n",
" <div style=\"margin-bottom: 20px;\">\n",
" <div style=\"width: 24px; height: 24px; background-color: #DBF5FF; border: 3px solid #4CC9FF; border-radius: 5px; position: absolute;\"> </div>\n",
" <div style=\"margin-left: 48px;\">\n",
" <details>\n",
" <summary>\n",
" <h4 style=\"margin-bottom: 0px; display: inline;\">Worker: 6</h4>\n",
" </summary>\n",
" <table style=\"width: 100%; text-align: left;\">\n",
" <tr>\n",
" <td style=\"text-align: left;\">\n",
" <strong>Comm: </strong> tcp://127.0.0.1:42725\n",
" </td>\n",
" <td style=\"text-align: left;\">\n",
" <strong>Total threads: </strong> 1\n",
" </td>\n",
" </tr>\n",
" <tr>\n",
" <td style=\"text-align: left;\">\n",
" <strong>Dashboard: </strong> <a href=\"http://127.0.0.1:35537/status\" target=\"_blank\">http://127.0.0.1:35537/status</a>\n",
" </td>\n",
" <td style=\"text-align: left;\">\n",
" <strong>Memory: </strong> 15.69 GiB\n",
" </td>\n",
" </tr>\n",
" <tr>\n",
" <td style=\"text-align: left;\">\n",
" <strong>Nanny: </strong> tcp://127.0.0.1:44941\n",
" </td>\n",
" <td style=\"text-align: left;\"></td>\n",
" </tr>\n",
" <tr>\n",
" <td colspan=\"2\" style=\"text-align: left;\">\n",
" <strong>Local directory: </strong> /home/jovyan/work/dask-worker-space/worker-k4x3hwuw\n",
" </td>\n",
" </tr>\n",
"\n",
" \n",
"\n",
" \n",
"\n",
" </table>\n",
" </details>\n",
" </div>\n",
" </div>\n",
" \n",
" <div style=\"margin-bottom: 20px;\">\n",
" <div style=\"width: 24px; height: 24px; background-color: #DBF5FF; border: 3px solid #4CC9FF; border-radius: 5px; position: absolute;\"> </div>\n",
" <div style=\"margin-left: 48px;\">\n",
" <details>\n",
" <summary>\n",
" <h4 style=\"margin-bottom: 0px; display: inline;\">Worker: 7</h4>\n",
" </summary>\n",
" <table style=\"width: 100%; text-align: left;\">\n",
" <tr>\n",
" <td style=\"text-align: left;\">\n",
" <strong>Comm: </strong> tcp://127.0.0.1:39301\n",
" </td>\n",
" <td style=\"text-align: left;\">\n",
" <strong>Total threads: </strong> 1\n",
" </td>\n",
" </tr>\n",
" <tr>\n",
" <td style=\"text-align: left;\">\n",
" <strong>Dashboard: </strong> <a href=\"http://127.0.0.1:40701/status\" target=\"_blank\">http://127.0.0.1:40701/status</a>\n",
" </td>\n",
" <td style=\"text-align: left;\">\n",
" <strong>Memory: </strong> 15.69 GiB\n",
" </td>\n",
" </tr>\n",
" <tr>\n",
" <td style=\"text-align: left;\">\n",
" <strong>Nanny: </strong> tcp://127.0.0.1:38189\n",
" </td>\n",
" <td style=\"text-align: left;\"></td>\n",
" </tr>\n",
" <tr>\n",
" <td colspan=\"2\" style=\"text-align: left;\">\n",
" <strong>Local directory: </strong> /home/jovyan/work/dask-worker-space/worker-d1_qblb5\n",
" </td>\n",
" </tr>\n",
"\n",
" \n",
"\n",
" \n",
"\n",
" </table>\n",
" </details>\n",
" </div>\n",
" </div>\n",
" \n",
"\n",
" </details>\n",
"</div>\n",
"\n",
" </details>\n",
" </div>\n",
"</div>\n",
" </details>\n",
" \n",
"\n",
" </div>\n",
"</div>"
],
"text/plain": [
"<Client: 'tcp://127.0.0.1:35287' processes=8 threads=8, memory=125.54 GiB>"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"client = Client(LocalCluster(n_workers=8, threads_per_worker=1))\n",
"client"
]
},
{
"cell_type": "markdown",
"id": "1e572437-2644-41cc-8c1c-27f585d72662",
"metadata": {},
"source": [
"With Dask, we create the whole pipeline to create a task graph and then run it all, so we won't have the timings for individual steps, just the total one.\n",
"\n",
"Read parking data CSV into a partitioned data frame (25mb per partition). "
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "f69be134-c9a8-4b9b-b92c-20115b816bbf",
"metadata": {},
"outputs": [],
"source": [
"ddf = dask.dataframe.read_csv(\"phl_parking.csv\", blocksize=25e6, assume_missing=True)"
]
},
{
"cell_type": "markdown",
"id": "5f17a9b8-58a4-4999-ac20-04061434baf1",
"metadata": {},
"source": [
"Create point geometry and assign it to the data frame, creating `dask_geopandas.GeoDataFrame`."
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "9ed3c741-0c0d-47ab-b3ef-05c4a5e6737b",
"metadata": {},
"outputs": [],
"source": [
"ddf = ddf.set_geometry(dask_geopandas.points_from_xy(ddf, x=\"lon\", y=\"lat\", crs=4326))"
]
},
{
"cell_type": "markdown",
"id": "9026027f-2d15-47b4-a6fb-e1c16fc7cc17",
"metadata": {},
"source": [
"Read neighbourhood polygons and reproject to EPSG:4326 (same as parking data)."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "4d9e4350-7b9b-43e1-86a5-5cf746d24704",
"metadata": {},
"outputs": [],
"source": [
"neigh = geopandas.read_file(\"Neighborhoods_Philadelphia\").to_crs(4326)"
]
},
{
"cell_type": "markdown",
"id": "110dd7c1-f78f-46fa-933a-63cbd673bb0c",
"metadata": {},
"source": [
"Create the spatial join."
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "e70a9d45-8f3b-400b-9c46-43484efe6205",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/opt/conda/lib/python3.9/site-packages/dask_geopandas/sjoin.py:60: UserWarning: CRS mismatch between the CRS of left geometries and the CRS of right geometries.\n",
"Use `to_crs()` to reproject one of the input geometries to match the CRS of the other.\n",
"\n",
"Left CRS: None\n",
"Right CRS: EPSG:4326\n",
"\n",
" meta = geopandas.sjoin(left._meta, right._meta, how=how, predicate=predicate)\n"
]
}
],
"source": [
"joined = dask_geopandas.sjoin(ddf, neigh, predicate=\"within\")"
]
},
{
"cell_type": "markdown",
"id": "145535a7-e189-46a7-931a-89dc77d24fe5",
"metadata": {},
"source": [
"Finally, let's compute the result."
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "7ef84e4c-af87-41f4-a6ca-e47cbc159fe3",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 9.25 s, sys: 2 s, total: 11.2 s\n",
"Wall time: 21.2 s\n"
]
}
],
"source": [
"%%time\n",
"r = joined.compute()"
]
},
{
"cell_type": "markdown",
"id": "60403e92-2ff8-4a04-a748-06946d08c02d",
"metadata": {},
"source": [
"Time on a local cluster with 8 workers and 1 thread per worker to pretend it is 8-core CPU:\n",
"\n",
" CPU times: user 9.34 s, sys: 2.09 s, total: 11.4 s\n",
" Wall time: 21.3 s\n",
" \n",
"The complete pipeline took 21.3 seconds including sending all data to a single process in the end to create a single partition joined GeoDataFrame. Usually, that is not necessary as you work with the data directly in Dask. It does take a few seconds guessing from the Dask Dashboard. \n",
"\n",
"Let's compare it to the PostGIS solution:\n",
"\n",
"- Reading in the 9M records from CSV takes about **29 seconds**\n",
"- Making a second copy while creating a geometry column takes about **24 seconds**\n",
"- The final query running with 4 workers takes **24 seconds**\n",
"\n",
"That gives us a total of **77 seconds**, compared to **21 seconds** using Dask-GeoPandas. Still slower than **13 seconds** using RAPIDS.AI (although that covers only the join itself, not reading and creating geometry, so my sense is that it will be almost equal). One aspect that makes the difference between Dask and PostGIS is that our pipeline is parallelised at every step - reading the CSV, creating points, generating spatial index (that is done under the hood in `sjoin`), the actual join. \n",
"\n",
"While Paul was using 8-core machine, PostGIS actually utilised only 4 cores (I am not sure why). Let's try to run our code limited to 4 workers as well.\n",
"\n",
" CPU times: user 9.53 s, sys: 2 s, total: 11.5 s\n",
" Wall time: 28.4 s\n",
"\n",
"**28 seconds** is a bit slower than before, but still quite fast!\n",
"\n",
"When comparing PostGIS and GPU solutions, Paul says\n",
"\n",
"> Basically, it is very hard to beat a bespoke performance solution with a general purpose tool. Yet, PostgreSQL/PostGIS comes within \"good enough\" range of a high end GPU solution, so that counts as a \"win\" to me.\n",
"\n",
"At the moment, Dask-GeoPandas is somewhere between PostGIS and bespoke solution. It does not offer as many functions as PostGIS, but it is designed as a general purpose tool. So I would say that we are all winners here.\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.7"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment