Tim Robertson timrobertson100

## gist:5fc8e27c3dc1afa703d5

----
-- Compress data (2.8 million record result set)
-- Runtime: 37 secs
----
SET hive.exec.compress.output=true;
SET io.seqfile.compression.type=BLOCK;
SET mapred.output.compression.codec = org.gbif.hadoop.compress.d2.D2Codec;

CREATE TABLE tim.occurrence_tab_def2

## gist:58bd652a418e6c9691a6

1) get tab file (runtime 2 mins):
$ hadoop dfs -getmerge /user/hive/warehouse/tim.db/occurrence_tab occurrence.txt
-> problem #1.  5.4GB file just pulled of hadoop

2) zip the file on the local filesystem (runtime 90 secs)
$ zip local.zip occurrence_tab.txt


## gist:8a1ced7a91b7e7f3812a
# Detailed steps to distribute a new codec for Hadoop for use in Hive tab delimited files.

##
# 1: Copy up the compress jar around the cluster
##
$ scp hadoop-compress-1.0-SNAPSHOT.jar root@c2n1.gbif.org:/usr/local/lib
$ scp hadoop-compress-1.0-SNAPSHOT.jar root@c2n2.gbif.org:/usr/local/lib
$ scp hadoop-compress-1.0-SNAPSHOT.jar root@c2n3.gbif.org:/usr/local/lib

##

## gist:8793ab1c3410b19c4fde

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                timrobertson100
                / gist:8793ab1c3410b19c4fde
            
            
              Last active
              August 29, 2015 14:06
            
          
    Summary of the GBIF dev environment for the codec


This is what I did... Now we need to work out what was unncessary!

Copy Jar around the slaves

$ scp hadoop-compress-1.0-SNAPSHOT.jar root@c2n1.gbif.org:/usr/local/lib
$ scp hadoop-compress-1.0-SNAPSHOT.jar root@c2n2.gbif.org:/usr/local/lib
$ scp hadoop-compress-1.0-SNAPSHOT.jar root@c2n3.gbif.org:/usr/local/lib


## gist:9b8d23ac0bff444059a3

      
              1 file
            
          
              0 forks
            
          
              0 comments
            
          
              0 stars
            
          
                timrobertson100
                / gist:9b8d23ac0bff444059a3
            
            
              Last active
              August 29, 2015 14:06
            
          
    This explains the dip we are seeing on Plantae on http://oliver.gbif.org/global/
SELECT
  occ1.k, occ1.cnt, occ2.cnt, occ2.cnt - occ1.cnt as increase
FROM
  (SELECT COALESCE(kingdom, 'UNKNOWN') AS k, count(*) AS cnt 
   FROM occurrence_20140908 GROUP BY kingdom) occ1 
JOIN
  (SELECT COALESCE(kingdom, 'UNKNOWN') AS k, count(*) AS cnt 


## gist:1f0d68c8339e88b7c7de

      
              1 file
            
          
              0 forks
            
          
              1 comment
            
          
              0 stars
            
          
                timrobertson100
                / gist:1f0d68c8339e88b7c7de
            
            
              Last active
              August 29, 2015 14:06
            
              
                Reducing occurrence download widths to match content
              
          
    Optimizing the downloads for users

GBIF.org delivers really wide tables, which are unmanageable for many, and slow to work with.  By only returning columns with actual values in the data returned for any query, users will have narrower tables and will be easier to manage.
Currently we have 441 fields in occurrence_hdfs.  Of these, across all records, only 347 are populated in one or more records.
We could consider

creating occurrence_hdfs only as wide as it needs to be - e.g. skip terms never populated (speeding up download MR jobs)
doing the same query before each download query will likely reduce the width further depending on the biases in the data


## gist:cbbd1175bcfce8132746
Task Logs: 'attempt_201410110944_1171_m_000000_0'


stdout logs

Oozie Launcher starts

Heart beat
Starting the execution of prepare actions

## gist:f198cf16a14b347e9261
#!/bin/bash

# Runs the templated script using impala-shell.
# Impala-shell does not support parameterized scripts like hive, so sed is used to rewrite the hive template
# before passing to impala-shell.

# Parameters passed to the script
# TODO: consider using getopts and named arguments
export DB_NAME=$1
export TABLE_VERBATIM=$2

## snapshotSchema.sql
CREATE TABLE snapshot.occurrence_${date}(
  id int,
  dataset_id ${datasetType},  -- Note: this is STRING (UUID) in newer versions (Post Sept. 2013) and INT in older
  publisher_id int,
  kingdom string,
  phylum string,
  class_rank string,
  order_rank string,
  family string,
  genus string,

## gist:01818f7252e04cbf6c04

      
              1 file
            
          
              0 forks
            
          
              1 comment
            
          
              0 stars
            
          
                timrobertson100
                / gist:01818f7252e04cbf6c04
            
            
              Last active
              August 29, 2015 14:08
            
              
                Templated IDs in the DwC-A evolution 
              
          
    Ideas around W3C CSV Metadata for DwC-A evolution

Please note, this will become a rambling dump of my thoughts, so please do not copy, or repeat any of this as it may be inaccurate...
URI Templates to make links from content

RFC6570 looks highly interesting in the DwC-A evolution.
It looks likely to allow us to separate identifiers, from resolution using a new standard for templated URIs.  In the metadata we should be able to have something like:

	----
	-- Compress data (2.8 million record result set)
	-- Runtime: 37 secs
	----
	SET hive.exec.compress.output=true;
	SET io.seqfile.compression.type=BLOCK;
	SET mapred.output.compression.codec = org.gbif.hadoop.compress.d2.D2Codec;

	CREATE TABLE tim.occurrence_tab_def2

	1) get tab file (runtime 2 mins):
	$ hadoop dfs -getmerge /user/hive/warehouse/tim.db/occurrence_tab occurrence.txt
	-> problem #1. 5.4GB file just pulled of hadoop

	2) zip the file on the local filesystem (runtime 90 secs)
	$ zip local.zip occurrence_tab.txt
	# Detailed steps to distribute a new codec for Hadoop for use in Hive tab delimited files.

	##
	# 1: Copy up the compress jar around the cluster
	##
	$ scp hadoop-compress-1.0-SNAPSHOT.jar root@c2n1.gbif.org:/usr/local/lib
	$ scp hadoop-compress-1.0-SNAPSHOT.jar root@c2n2.gbif.org:/usr/local/lib
	$ scp hadoop-compress-1.0-SNAPSHOT.jar root@c2n3.gbif.org:/usr/local/lib

	##
	Task Logs: 'attempt_201410110944_1171_m_000000_0'



	stdout logs

	Oozie Launcher starts

	Heart beat
	Starting the execution of prepare actions
	#!/bin/bash

	# Runs the templated script using impala-shell.
	# Impala-shell does not support parameterized scripts like hive, so sed is used to rewrite the hive template
	# before passing to impala-shell.

	# Parameters passed to the script
	# TODO: consider using getopts and named arguments
	export DB_NAME=$1
	export TABLE_VERBATIM=$2
	CREATE TABLE snapshot.occurrence_${date}(
	id int,
	dataset_id ${datasetType}, -- Note: this is STRING (UUID) in newer versions (Post Sept. 2013) and INT in older
	publisher_id int,
	kingdom string,
	phylum string,
	class_rank string,
	order_rank string,
	family string,
	genus string,