Skip to content

Instantly share code, notes, and snippets.

@jorisvandenbossche
Last active October 20, 2021 07:44
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jorisvandenbossche/c57fd2693a4442a539797ebe46df3a61 to your computer and use it in GitHub Desktop.
Save jorisvandenbossche/c57fd2693a4442a539797ebe46df3a61 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# pyogrio - performance of reading with array of FIDs"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import geopandas\n",
"import geopandas.testing\n",
"\n",
"import pyogrio"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Creating data\n",
"\n",
"Creating a very simple file (point geometries, single integer attribute field), so that we mostly measure the overhead from reading all vs reading with FIDs (limiting the time to parse geometries and fields):"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"N = 100_000"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"gdf = geopandas.GeoDataFrame({\"col\": range(N), \"geometry\": geopandas.points_from_xy(np.random.randn(N), np.random.randn(N))}, crs=\"EPSG:4326\")"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"pyogrio.write_dataframe(gdf, \"benchmark-data/test_points.gpkg\", driver=\"GPKG\")"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"pyogrio.write_dataframe(gdf, \"benchmark-data/test_points.shp\", driver=\"ESRI Shapefile\")"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"pyogrio.write_dataframe(gdf, \"benchmark-data/test_points.geojson\", driver=\"GeoJSON\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Capabilities of the different file formats:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'crs': 'EPSG:4326',\n",
" 'encoding': 'UTF-8',\n",
" 'fields': array(['col'], dtype=object),\n",
" 'geometry_type': 'Point',\n",
" 'features': 100000,\n",
" 'capabilities': {'random_read': 1, 'fast_set_next_by_index': 1}}"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pyogrio.read_info(\"benchmark-data/test_points.shp\")"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'crs': 'EPSG:4326',\n",
" 'encoding': 'UTF-8',\n",
" 'fields': array(['col'], dtype=object),\n",
" 'geometry_type': 'Point',\n",
" 'features': 100000,\n",
" 'capabilities': {'random_read': 1, 'fast_set_next_by_index': 0}}"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pyogrio.read_info(\"benchmark-data/test_points.gpkg\")"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'crs': 'EPSG:4326',\n",
" 'encoding': 'UTF-8',\n",
" 'fields': array(['col'], dtype=object),\n",
" 'geometry_type': 'Point',\n",
" 'features': 100000,\n",
" 'capabilities': {'random_read': 1, 'fast_set_next_by_index': 1}}"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"pyogrio.read_info(\"benchmark-data/test_points.geojson\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Benchmark reading performance"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"indices = np.arange(N, dtype=\"int32\")\n",
"indices_shuffled = np.arange(N, dtype=\"int32\")\n",
"np.random.shuffle(indices_shuffled)"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"res1 = pyogrio.read_dataframe(\"benchmark-data/test_points.shp\")\n",
"res2 = pyogrio.read_dataframe(\"benchmark-data/test_points.shp\", fids=indices)\n",
"geopandas.testing.assert_geodataframe_equal(res1, res2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Reading Shapefile:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"142 ms ± 898 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
]
}
],
"source": [
"%timeit pyogrio.read_dataframe(\"benchmark-data/test_points.shp\")"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"142 ms ± 988 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
]
}
],
"source": [
"%timeit pyogrio.read_dataframe(\"benchmark-data/test_points.shp\", fids=indices)"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"276 ms ± 2.41 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
]
}
],
"source": [
"%timeit pyogrio.read_dataframe(\"benchmark-data/test_points.shp\", fids=indices_shuffled)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Reading GeoPackage:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"149 ms ± 1.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
]
}
],
"source": [
"%timeit pyogrio.read_dataframe(\"benchmark-data/test_points.gpkg\")"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"# GPKG starts to count at 1\n",
"indices_gpkg = indices + 1\n",
"indices_shuffled_gpkg = indices_shuffled + 1"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1.33 s ± 11.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
]
}
],
"source": [
"%timeit pyogrio.read_dataframe(\"benchmark-data/test_points.gpkg\", fids=indices_gpkg)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1.46 s ± 81.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
]
}
],
"source": [
"%timeit pyogrio.read_dataframe(\"benchmark-data/test_points.gpkg\", fids=indices_shuffled_gpkg)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Reading GeoJSON:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"698 ms ± 10 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
]
}
],
"source": [
"%timeit pyogrio.read_dataframe(\"benchmark-data/test_points.geojson\")"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1.3 s ± 12.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
]
}
],
"source": [
"%timeit pyogrio.read_dataframe(\"benchmark-data/test_points.geojson\", fids=indices)"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1.53 s ± 45.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
]
}
],
"source": [
"%timeit pyogrio.read_dataframe(\"benchmark-data/test_points.geojson\", fids=indices_shuffled)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Using indices instead of FIDs**:"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"res3 = pyogrio.read_dataframe(\"benchmark-data/test_points.shp\", indices=indices)\n",
"geopandas.testing.assert_geodataframe_equal(res1, res3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For Shapefile reading the whole dataset gives the same performance as reading with FIDs (see above):"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"142 ms ± 283 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
]
}
],
"source": [
"%timeit pyogrio.read_dataframe(\"benchmark-data/test_points.shp\", indices=indices)"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"278 ms ± 9.21 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
]
}
],
"source": [
"%timeit pyogrio.read_dataframe(\"benchmark-data/test_points.shp\", indices=indices_shuffled)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For other file formats this would take too long, so only reading a small subset of 100 features:"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
"indices_subset = np.sort(indices_shuffled[:100])"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"7.06 ms ± 89.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n"
]
}
],
"source": [
"%timeit pyogrio.read_dataframe(\"benchmark-data/test_points.shp\", fids=indices_subset)"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"7.42 ms ± 482 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n"
]
}
],
"source": [
"%timeit pyogrio.read_dataframe(\"benchmark-data/test_points.shp\", indices=indices_subset)"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"10.2 ms ± 125 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)\n"
]
}
],
"source": [
"%timeit pyogrio.read_dataframe(\"benchmark-data/test_points.gpkg\", fids=indices_subset+1)"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1.77 s ± 190 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
]
}
],
"source": [
"%timeit pyogrio.read_dataframe(\"benchmark-data/test_points.gpkg\", indices=indices_subset)"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"693 ms ± 12.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
]
}
],
"source": [
"%timeit pyogrio.read_dataframe(\"benchmark-data/test_points.geojson\", fids=indices_subset)"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"CPU times: user 13.6 s, sys: 126 ms, total: 13.7 s\n",
"Wall time: 13.7 s\n"
]
}
],
"source": [
"%time _ = pyogrio.read_dataframe(\"benchmark-data/test_points.geojson\", indices=indices_subset)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Reading with bbox vs FIDs"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
"gdf = pyogrio.read_dataframe(\"benchmark-data/test_points.shp\")"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [],
"source": [
"from shapely.geometry import box"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [],
"source": [
"subset = gdf[gdf.intersects(box(0, 0, 1, 1))].reset_index(drop=True)\n",
"indices_subset = np.asarray(subset[\"col\"], dtype=\"int32\")"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.11667"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"len(subset) / len(gdf)"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [],
"source": [
"res = pyogrio.read_dataframe(\"benchmark-data/test_points.shp\", bbox=(0, 0, 1, 1))\n",
"geopandas.testing.assert_geodataframe_equal(res, subset)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Shapefile:"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"88.2 ms ± 4.34 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
]
}
],
"source": [
"%timeit pyogrio.read_dataframe(\"benchmark-data/test_points.shp\", bbox=(0, 0, 1, 1))"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"43.6 ms ± 1.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
]
}
],
"source": [
"%timeit pyogrio.read_dataframe(\"benchmark-data/test_points.shp\", fids=indices_subset)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"GeoPackage:"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"81.6 ms ± 2.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)\n"
]
}
],
"source": [
"%timeit pyogrio.read_dataframe(\"benchmark-data/test_points.gpkg\", bbox=(0, 0, 1, 1))"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"301 ms ± 6.43 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
]
}
],
"source": [
"%timeit pyogrio.read_dataframe(\"benchmark-data/test_points.gpkg\", fids=indices_subset)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"GeoJSON:"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1.99 s ± 28.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
]
}
],
"source": [
"%timeit pyogrio.read_dataframe(\"benchmark-data/test_points.geojson\", bbox=(0, 0, 1, 1))"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1.24 s ± 7.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
]
}
],
"source": [
"%timeit pyogrio.read_dataframe(\"benchmark-data/test_points.geojson\", fids=indices_subset)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Benchmark with complex geometries"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `tl_2019_us_zcta510` shapefile rewritten as GeoPackage: this dataset has complex polygon geometries and several attribute fields."
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"239 ms ± 4.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
]
}
],
"source": [
"%timeit pyogrio.read_dataframe(\"benchmark-data/test_us_zcta.gpkg\", max_features=1000)"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"282 ms ± 4.18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)\n"
]
}
],
"source": [
"%timeit pyogrio.read_dataframe(\"benchmark-data/test_us_zcta.gpkg\", fids=np.arange(1, 1001, dtype=\"int32\"))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So in this case the overhead of reading by FID is less significant."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python (geo-dev)",
"language": "python",
"name": "geo-dev"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.2"
},
"widgets": {
"application/vnd.jupyter.widget-state+json": {
"state": {},
"version_major": 2,
"version_minor": 0
}
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment