Skip to content

Instantly share code, notes, and snippets.

@jhnwllr
Last active April 29, 2021 13:53
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jhnwllr/eb3592df351bc2df4fa0be5d77c1a88e to your computer and use it in GitHub Desktop.
Save jhnwllr/eb3592df351bc2df4fa0be5d77c1a88e to your computer and use it in GitHub Desktop.
import org.apache.spark.sql.functions._
val wasbs_path = "wasbs://gbif@ai4edataeuwest.blob.core.windows.net/occurrence/20210413/occurrence.parquet/*"
val df = spark.read.parquet(wasbs_path)
// Number species total
df.select("specieskey").distinct().count()
// Number species by Kingdom
df.select("kingdom","specieskey").distinct().groupBy("kingdom").count().orderBy(desc("count")).show()
// Number records total
df.count()
// Number records by Kingdom
df.groupBy("kingdom").count().orderBy(desc("count")).show()
// Number of datasets
df.select("datasetkey").distinct().count()
// Number of publishers
df.select("publishingorgkey").distinct().count()
// Number of observations, specimens etc
df.groupBy("basisofrecord").count().orderBy(desc("count")).show()
@timrobertson100
Copy link

Thanks John

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment