Skip to content

Instantly share code, notes, and snippets.

@nickynicolson
Created May 5, 2022 11:34
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save nickynicolson/2c113eaeaa1a4d492355b5d160ebb23f to your computer and use it in GitHub Desktop.
Save nickynicolson/2c113eaeaa1a4d492355b5d160ebb23f to your computer and use it in GitHub Desktop.

Set up a dask gateway and print the dashboard URL

import dask_gateway

cluster = dask_gateway.GatewayCluster()
client = cluster.get_client()
cluster.scale(8)
print(cluster.dashboard_link)

Get a client

client = cluster.get_client()

As per previous code example, import dask / STAC resources etc

import pystac_client
import geopandas
import dask_geopandas
import contextily as ctx
import planetary_computer
import adlfs
import pyarrow.fs
import dask.dataframe as dd
from adlfs import AzureBlobFileSystem

catalog = pystac_client.Client.open(
    "https://planetarycomputer.microsoft.com/api/stac/v1/"
)
gbif = catalog.get_collection("gbif")

With the GBIF collection reference, get an "item" (a particular dated snapshot) and read all of its constituent parquet files. These nested files do not have the .parquet file extension, so the require_extension option is set to None. Here we're reading a subset of the available columns and using filters (see explanation of filter format at: https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html) to get only the records for vascular plant specimens with HOLOTYPE / ISOTYPE status into a dask dataframe.

import dask.dataframe as dd

occ = planetary_computer.sign(gbif.get_item(id="gbif-2022-03-01"))
asset = occ.assets["data"]

ddf = dd.read_parquet(
    asset.href,
    storage_options=asset.extra_fields["table:storage_options"],
    dataset={"require_extension": None},
    columns=["gbifid", "phylum", "scientificname",'locality','typestatus','decimallatitude','decimallongitude','year','eventdate','recordedby','recordnumber','basisofrecord','taxonkey'],
    engine="pyarrow",
    filters=[('typestatus','in',('HOLOTYPE','ISOTYPE')),('basisofrecord','=','PRESERVED_SPECIMEN'),('phylum','=','Tracheophyta')]
    # filters=[('basisofrecord','!=','FOSSIL_SPECIMEN'),('phylum','=','Tracheophyta')]
)
ddf

Call compute to actually do the filter, watch read progress in the dashboard:

df = ddf.compute()
print(type(df))
df 
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment