Skip to content

Instantly share code, notes, and snippets.

View timrobertson100's full-sized avatar
🌴
On vacation

Tim Robertson timrobertson100

🌴
On vacation
View GitHub Profile
----
-- Compress data (2.8 million record result set)
-- Runtime: 37 secs
----
SET hive.exec.compress.output=true;
SET io.seqfile.compression.type=BLOCK;
SET mapred.output.compression.codec = org.gbif.hadoop.compress.d2.D2Codec;
CREATE TABLE tim.occurrence_tab_def2
1) get tab file (runtime 2 mins):
$ hadoop dfs -getmerge /user/hive/warehouse/tim.db/occurrence_tab occurrence.txt
-> problem #1. 5.4GB file just pulled of hadoop
2) zip the file on the local filesystem (runtime 90 secs)
$ zip local.zip occurrence_tab.txt
# Detailed steps to distribute a new codec for Hadoop for use in Hive tab delimited files.
##
# 1: Copy up the compress jar around the cluster
##
$ scp hadoop-compress-1.0-SNAPSHOT.jar root@c2n1.gbif.org:/usr/local/lib
$ scp hadoop-compress-1.0-SNAPSHOT.jar root@c2n2.gbif.org:/usr/local/lib
$ scp hadoop-compress-1.0-SNAPSHOT.jar root@c2n3.gbif.org:/usr/local/lib
##

Summary of the GBIF dev environment for the codec

This is what I did... Now we need to work out what was unncessary!

Copy Jar around the slaves

$ scp hadoop-compress-1.0-SNAPSHOT.jar root@c2n1.gbif.org:/usr/local/lib
$ scp hadoop-compress-1.0-SNAPSHOT.jar root@c2n2.gbif.org:/usr/local/lib
$ scp hadoop-compress-1.0-SNAPSHOT.jar root@c2n3.gbif.org:/usr/local/lib

This explains the dip we are seeing on Plantae on http://oliver.gbif.org/global/

SELECT
  occ1.k, occ1.cnt, occ2.cnt, occ2.cnt - occ1.cnt as increase
FROM
  (SELECT COALESCE(kingdom, 'UNKNOWN') AS k, count(*) AS cnt 
   FROM occurrence_20140908 GROUP BY kingdom) occ1 
JOIN
  (SELECT COALESCE(kingdom, 'UNKNOWN') AS k, count(*) AS cnt 
@timrobertson100
timrobertson100 / gist:1f0d68c8339e88b7c7de
Last active August 29, 2015 14:06
Reducing occurrence download widths to match content

Optimizing the downloads for users

GBIF.org delivers really wide tables, which are unmanageable for many, and slow to work with. By only returning columns with actual values in the data returned for any query, users will have narrower tables and will be easier to manage.

Currently we have 441 fields in occurrence_hdfs. Of these, across all records, only 347 are populated in one or more records.

We could consider

  1. creating occurrence_hdfs only as wide as it needs to be - e.g. skip terms never populated (speeding up download MR jobs)
  2. doing the same query before each download query will likely reduce the width further depending on the biases in the data
@timrobertson100
timrobertson100 / gist:cbbd1175bcfce8132746
Created October 13, 2014 15:16
Complete download log
Task Logs: 'attempt_201410110944_1171_m_000000_0'
stdout logs
Oozie Launcher starts
Heart beat
Starting the execution of prepare actions
@timrobertson100
timrobertson100 / gist:f198cf16a14b347e9261
Last active August 29, 2015 14:07
Impala run script
#!/bin/bash
# Runs the templated script using impala-shell.
# Impala-shell does not support parameterized scripts like hive, so sed is used to rewrite the hive template
# before passing to impala-shell.
# Parameters passed to the script
# TODO: consider using getopts and named arguments
export DB_NAME=$1
export TABLE_VERBATIM=$2
@timrobertson100
timrobertson100 / snapshotSchema.sql
Created October 22, 2014 06:50
Schema for analytics snapshots for LR
CREATE TABLE snapshot.occurrence_${date}(
id int,
dataset_id ${datasetType}, -- Note: this is STRING (UUID) in newer versions (Post Sept. 2013) and INT in older
publisher_id int,
kingdom string,
phylum string,
class_rank string,
order_rank string,
family string,
genus string,
@timrobertson100
timrobertson100 / gist:01818f7252e04cbf6c04
Last active August 29, 2015 14:08
Templated IDs in the DwC-A evolution

Ideas around W3C CSV Metadata for DwC-A evolution

Please note, this will become a rambling dump of my thoughts, so please do not copy, or repeat any of this as it may be inaccurate...

URI Templates to make links from content

RFC6570 looks highly interesting in the DwC-A evolution.

It looks likely to allow us to separate identifiers, from resolution using a new standard for templated URIs. In the metadata we should be able to have something like: