Skip to content

Instantly share code, notes, and snippets.

@4rzael
Last active April 25, 2024 04:41
Show Gist options
  • Star 29 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save 4rzael/bbd543af0cb2ee087771f42c5aefdad7 to your computer and use it in GitHub Desktop.
Save 4rzael/bbd543af0cb2ee087771f42c5aefdad7 to your computer and use it in GitHub Desktop.
GIS with pySpark.
NOTE : Take a look at the comments below !

GIS with pySpark : A not-so-easy journey

Why would you do that ?

Today, many datas are geolocalised (meaning that they have a position in space). They're named GIS datas.

It's not rare that we need to do operations on those, such as aggregations, and there are many optimisations existing to do that.

The easiest way to do so is to use either geopandas, or a spatial database such as postGIS, allowing spatial-joins, for example.

The problem ? We want to do it FAST. So we need a scalable way to do so, and here comes... SPAAAARK !

Spark [1] is a monster. It allows to store datas and make computation on them in a distributed way.

But... How to make it handle GIS data ? And using Spark2 and python, if possible ?

What tools should I use ?

Magellan [2]

  • Compatible Spark2
  • Compatible pySpark
  • Efficient spatial joins
  • Correctly maintained

SpatialSpark [3]

  • Compatible Spark2
  • Compatible pySpark
  • Efficient spatial joins
  • Correctly maintained

pySpark + shapely (hacky way) [4]

  • Compatible Spark2
  • Compatible pySpark
  • Efficient spatial joins
  • Correctly maintained

pySpark + geopandas (hacky way) [5]

  • Compatible Spark2
  • Compatible pySpark
  • Efficient spatial joins
  • Correctly maintained

GeoSpark [6]

  • Compatible Spark2
  • Compatible pySpark
  • Efficient spatial joins
  • Correctly maintained

LocationSpark [7]

  • Compatible Spark2
  • Compatible pySpark
  • Efficient spatial joins
  • Correctly maintained

UPDATE 20/04/2018:

Multiple modules now switched to spark2. However, no great alternative have been found. If you have a solution, please contact me so I can add it here.

References

  1. Spark
  2. Magellan
  3. SpatialSpark
  4. pySpark + Shapely
  5. pySpark + geopandas
  6. GeoSpark
  7. LocationSpark
@bflammers
Copy link

Hi @4rzael,

GeoPySpark allows processing large amounts of raster data using PySpark. Unfortunately, operations like spatial joins on geometries are currently not supported. Please see this issue.

@harryprince
Copy link

harryprince commented Apr 7, 2019

@4rzael,
GIS with Sparklyr is pretty easy:
See geospark and sf R packages.
make traditional GISer handle geospatial big data easier with cheetsheet

@heaven00
Copy link

heaven00 commented Jul 8, 2019

Almost all codebases other than GeoSpark seem to be not updated for quite a while.

@gchamon
Copy link

gchamon commented Nov 12, 2021

RasterFrames seems to be an interesting contender

@mooseberrypi
Copy link

Spark now has an official GIS extension called mosaic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment