Skip to content

Instantly share code, notes, and snippets.

View timrobertson100's full-sized avatar
🌴
On vacation

Tim Robertson timrobertson100

🌴
On vacation
View GitHub Profile
package org.gbif.hadoop.compress;
import java.io.IOException;
import java.io.OutputStream;
import java.util.zip.CRC32;
import java.util.zip.Checksum;
import java.util.zip.Deflater;
import java.util.zip.DeflaterOutputStream;
/**
/**
* Writes the custom fixed length footer to the stream.
*/
@Override
public void finish() throws IOException {
flush(); // make sure deflater flushes, and counts are accurate
// Push the custom footer to the output stream
ByteBuffer footer = ByteBuffer.allocate(26);
footer.put(FOOTER_CLOSE_DEFLATE); // 2 bytes: which means the deflate stream can be read in isolation
/**
* An end to end test that writes some random files and ensures that when deflated separately, merged and inflated
* they represent the same byte sequence as a concatenation of the original files.
*/
@Test
public void testParallelCompress() throws IOException {
// generate the uncompressed files and create a merged version
List<File> parts = Lists.newArrayList();
for (int i = 0; i < NUMBER_PARTS; i++) {
----
-- Compress data (2.8 million record result set)
-- Runtime: 37 secs
----
SET hive.exec.compress.output=true;
SET io.seqfile.compression.type=BLOCK;
SET mapred.output.compression.codec = org.gbif.hadoop.compress.d2.D2Codec;
CREATE TABLE tim.occurrence_tab_def2
1) get tab file (runtime 2 mins):
$ hadoop dfs -getmerge /user/hive/warehouse/tim.db/occurrence_tab occurrence.txt
-> problem #1. 5.4GB file just pulled of hadoop
2) zip the file on the local filesystem (runtime 90 secs)
$ zip local.zip occurrence_tab.txt
# Detailed steps to distribute a new codec for Hadoop for use in Hive tab delimited files.
##
# 1: Copy up the compress jar around the cluster
##
$ scp hadoop-compress-1.0-SNAPSHOT.jar root@c2n1.gbif.org:/usr/local/lib
$ scp hadoop-compress-1.0-SNAPSHOT.jar root@c2n2.gbif.org:/usr/local/lib
$ scp hadoop-compress-1.0-SNAPSHOT.jar root@c2n3.gbif.org:/usr/local/lib
##

Summary of the GBIF dev environment for the codec

This is what I did... Now we need to work out what was unncessary!

Copy Jar around the slaves

$ scp hadoop-compress-1.0-SNAPSHOT.jar root@c2n1.gbif.org:/usr/local/lib
$ scp hadoop-compress-1.0-SNAPSHOT.jar root@c2n2.gbif.org:/usr/local/lib
$ scp hadoop-compress-1.0-SNAPSHOT.jar root@c2n3.gbif.org:/usr/local/lib

This explains the dip we are seeing on Plantae on http://oliver.gbif.org/global/

SELECT
  occ1.k, occ1.cnt, occ2.cnt, occ2.cnt - occ1.cnt as increase
FROM
  (SELECT COALESCE(kingdom, 'UNKNOWN') AS k, count(*) AS cnt 
   FROM occurrence_20140908 GROUP BY kingdom) occ1 
JOIN
  (SELECT COALESCE(kingdom, 'UNKNOWN') AS k, count(*) AS cnt 
@timrobertson100
timrobertson100 / gist:1f0d68c8339e88b7c7de
Last active August 29, 2015 14:06
Reducing occurrence download widths to match content

Optimizing the downloads for users

GBIF.org delivers really wide tables, which are unmanageable for many, and slow to work with. By only returning columns with actual values in the data returned for any query, users will have narrower tables and will be easier to manage.

Currently we have 441 fields in occurrence_hdfs. Of these, across all records, only 347 are populated in one or more records.

We could consider

  1. creating occurrence_hdfs only as wide as it needs to be - e.g. skip terms never populated (speeding up download MR jobs)
  2. doing the same query before each download query will likely reduce the width further depending on the biases in the data
@timrobertson100
timrobertson100 / gist:cbbd1175bcfce8132746
Created October 13, 2014 15:16
Complete download log
Task Logs: 'attempt_201410110944_1171_m_000000_0'
stdout logs
Oozie Launcher starts
Heart beat
Starting the execution of prepare actions