Skip to content

Instantly share code, notes, and snippets.

@jhpoelen
Last active March 24, 2023 22:09
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jhpoelen/569a3a787f6da542c8202ecddbacf580 to your computer and use it in GitHub Desktop.
Save jhpoelen/569a3a787f6da542c8202ecddbacf580 to your computer and use it in GitHub Desktop.
streaming query to extract records with collectionCode CASTYPE
#!/bin/bash
#
# prerequisites
# * preston https://github.com/bio-guoda/preston
# * pv pipeviewer https://linux.die.net/man/1/pv
# * mlr https://miller.readthedocs.io/en/6.7.0/
#
# executed/tested on 22.04.1-Ubuntu
#
# track the ~260GB
preston track https://api.gbif.org/v1/occurrence/download/request/0015281-230224095556074.zip
# at/around 2023-03-18T04:24:10.713Z,
# the tracked content had identifier hash://sha256/c8bac8acb28c8524c53589b3a40e322dbbbdadf5689fef2e20266fbf6ddf6b97
#
preston cat 'zip:hash://sha256/c8bac8acb28c8524c53589b3a40e322dbbbdadf5689fef2e20266fbf6ddf6b97!/0015281-230224095556074.csv'\
| pv -l\
| mlr --tsvlite filter '$collectionCode == "CASTYPE"'
# produced:
# 2.07G 5:16:05 [ 109k/s] [
# [and no matched records]
#
# meaning 2.07 billion lines counted in 5 hours and 16 minutes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment