Skip to content

Instantly share code, notes, and snippets.

View dshorthouse's full-sized avatar

David Shorthouse dshorthouse

View GitHub Profile
@dshorthouse
dshorthouse / bloodhound.md
Last active February 20, 2020 16:40 — forked from timrobertson100/bloodhound.md
A quick test to explore a bloodhound process

This is a quick test of a modified version of the Bloodhound spark script to check it runs on the GBIF Cloudera cluster (CDH 5.16.2).

From the gateway, grab the file from HDFS (skip HTTP for speed), unzip (15-20 mins) and upload to HDFS:

hdfs dfs -getmerge /occurrence-download/prod-downloads/0002504-181003121212138.zip /mnt/auto/misc/bloodhound/data.zip
unzip /mnt/auto/misc/bloodhound/data.zip -d /mnt/auto/misc/bloodhound/data

hdfs dfs -rm /tmp/verbatim.txt
hdfs dfs -rm /tmp/occurrence.txt