Skip to content

Instantly share code, notes, and snippets.

@jhpoelen
jhpoelen / find-openalex-bumbus-pubs.sh
Last active November 2, 2023 13:28
What Bombus works (publications) are available in a version copy of https://openalex.org ?
#!/bin/bash
#
# Find publications (aka "works") that mention "Bombus" (bumblebees) in a versioned copy of OpenAlex.
#
# Requirements:
# preston - https://github.com/bio-guoda/preston
# grep (comes with linux distro)
# gunzip (comes with linux distro)
# jq - https://jqlang.github.io/jq/
# mlr - https://miller.readthedocs.io/
#!/bin/bash
#
#
curl "https://depot.globalbioticinteractions.org/snapshot/target/data/tsv/datasets.tsv"\
| awk '{ print "https://depot.globalbioticinteractions.org/reviews/" $1 "/indexed-interactions.tsv" }'\
| xargs -L1 curl\
| xargs -L1 DiscretePowerLawfitter.sh\
> angel-review-of-globi-datasets.tsv
#!/bin/bash
#
# related to https://github.com/Big-Bee-Network/UCSB-IZC00012194
#
preston ls\
| grep hasVersion\
| preston grep 'UCSB-IZC00012194.*body\ssize' --log tsv\
| grep value\
| cut -f1
@jhpoelen
jhpoelen / streaming-query.sh
Last active March 24, 2023 22:09
streaming query to extract records with collectionCode CASTYPE
#!/bin/bash
#
# prerequisites
# * preston https://github.com/bio-guoda/preston
# * pv pipeviewer https://linux.die.net/man/1/pv
# * mlr https://miller.readthedocs.io/en/6.7.0/
#
# executed/tested on 22.04.1-Ubuntu
#
@jhpoelen
jhpoelen / compile-data.sh
Last active October 16, 2020 01:05
bash script to get pollination records
#!/bin/bash
#
# 2020-10-15
#
# This script is a way to select pollination and flower visits record from
# one of the data products provided via https://globalbioticinteractions.org/data .
#
# This particular example uses a July 2020 data publication.
#
# For more recent data, see https://globalbioticinteractions.org/data .
@jhpoelen
jhpoelen / README.md
Last active September 25, 2020 18:43

Script used for counting records / taxa

Attached big-bee-globi-stats.log was generated on 2020-09-25 using latest interactions.tsv.gz with

$ sha256sum interactions.tsv.gz 
436f1249dc71bc948483bac0d6f13c667e9d69456ef727037637516468e9d29d

To reproduce:

@jhpoelen
jhpoelen / count_geese.R
Last active May 27, 2020 23:50
reliable data use in R
prepare_ebird_2018_id <- function() {
ebird_data_location <- "http://ebirddata.ornith.cornell.edu/downloads/gbiff/dwca-1.0.zip"
# unfortunately, the eBird URL no longer work, but,
# using a time machine, we went back in time and republished data via Zenodo from 2018
ebird_data_location <- "https://zenodo.org/record/3858251/files/dwca-1.0.zip"
ebird_data_id <- contentid::register(ebird_data_location)
ebird_data_id
}
@jhpoelen
jhpoelen / create-checklist-cluster.sh
Last active August 5, 2019 20:01
idigbio-spark scripts
#!/bin/bash
#
#
WKT_STRING="POLYGON ((-72.77293810620904 -33.196074154826235, -72.77293810620904 6.59516197881252, -28.12450060620904 6.59516197881252, -28.12450060620904 -33.196074154826235, -72.77293810620904 -33.196074154826235))"
spark-submit \
--master mesos://zk://mesos01:2181,mesos02:2181,mesos03:2181/mesos \
--driver-memory 4G \
--conf spark.sql.caseSensitive=true \
@jhpoelen
jhpoelen / downloadIds.txt
Last active May 15, 2019 23:24
Retrieve Occurrence Downloads Cited in Literature
0000036-150827100048397
0000037-150827100048397
0000039-150827100048397
0000040-150827100048397
0000048-150306150734599
0000061-150827100048397
0000062-150827100048397
0000067-150827100048397
0000068-150827100048397
0000069-150827100048397
@jhpoelen
jhpoelen / calculateKingdomToKingdomInteractions.scala
Last active April 5, 2019 20:29
IVMOOC 2019 GloBI Kingdom To Kingdom Interactions
val taxa = spark.read.option("delimiter","""\t""").option("header","true").csv("taxonCache.tsv.bz2")
taxa.printSchema
import spark.implicits._
val taxonCache = spark.read.option("delimiter","""\t""").option("header","true").csv("taxonCache.tsv.bz2")
val taxonIdsPaths = taxonCache.select("id", "pathNames", "path").as[(String, String, String)].filter(_._2 != null).filter( _._3 != null).filter(_._1 != null)
val taxaIdToKingdom = taxonIdsPaths.map( r=> (r._1, r._2.split("\\|").map(_.trim), r._3.split("\\|").map(_.trim))).map(r => (r._1, r._2.zip(r._3))).map(r => (r._1, r._2.filter(_._1 == "kingdom").map(_._2).mkString)).filter(_._2.nonEmpty).filter(r => List("GBIF", "ITIS","WORMS", "INAT_TAXON").contains(r._1.split(":").head)).filter(_._2 != "incertae sedis")
taxaIdToKingdom.write.option("delimiter","""\t""").csv("taxaIdToKingdom.tsv")