nickynicolson/dask-filter-gbif-example.md

## dask-filter-gbif-example.md

      
    Raw
  

              dask-filter-gbif-example.md
            
          
    Set up a dask gateway and print the dashboard URL
import dask_gateway

cluster = dask_gateway.GatewayCluster()
client = cluster.get_client()
cluster.scale(8)
print(cluster.dashboard_link)
Get a client
client = cluster.get_client()
As per previous code example, import dask / STAC resources etc
import pystac_client
import geopandas
import dask_geopandas
import contextily as ctx
import planetary_computer
import adlfs
import pyarrow.fs
import dask.dataframe as dd
from adlfs import AzureBlobFileSystem

catalog = pystac_client.Client.open(
    "https://planetarycomputer.microsoft.com/api/stac/v1/"
)
gbif = catalog.get_collection("gbif")
With the GBIF collection reference, get an "item" (a particular dated snapshot) and read all of its constituent parquet files. These nested files do not have the .parquet file extension, so the require_extension option is set to None.
Here we're reading a subset of the available columns and using filters (see explanation of filter format at: https://arrow.apache.org/docs/python/generated/pyarrow.parquet.ParquetDataset.html) to get only the records for vascular plant specimens with HOLOTYPE / ISOTYPE status into a dask dataframe.
import dask.dataframe as dd

occ = planetary_computer.sign(gbif.get_item(id="gbif-2022-03-01"))
asset = occ.assets["data"]

ddf = dd.read_parquet(
    asset.href,
    storage_options=asset.extra_fields["table:storage_options"],
    dataset={"require_extension": None},
    columns=["gbifid", "phylum", "scientificname",'locality','typestatus','decimallatitude','decimallongitude','year','eventdate','recordedby','recordnumber','basisofrecord','taxonkey'],
    engine="pyarrow",
    filters=[('typestatus','in',('HOLOTYPE','ISOTYPE')),('basisofrecord','=','PRESERVED_SPECIMEN'),('phylum','=','Tracheophyta')]
    # filters=[('basisofrecord','!=','FOSSIL_SPECIMEN'),('phylum','=','Tracheophyta')]
)
ddf
Call compute to actually do the filter, watch read progress in the dashboard:
df = ddf.compute()
print(type(df))
df