Skip to content

Instantly share code, notes, and snippets.

View timrobertson100's full-sized avatar
🌴
On vacation

Tim Robertson timrobertson100

🌴
On vacation
View GitHub Profile
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE mapper PUBLIC "-//mybatis.org//DTD Mapper 3.0//EN" "http://mybatis.org/dtd/mybatis-3-mapper.dtd" >
<mapper namespace="org.gbif.registry.persistence.mapper.DatasetProcessStatusMapper">
<resultMap id="CRAWL_JOB_MAP" type="CrawlJob">
<constructor>
<idArg column="dataset_key" javaType="java.util.UUID" jdbcType="OTHER"/>
<idArg column="attempt" javaType="int"/>
<arg column="endpoint_type" javaType="org.gbif.api.vocabulary.EndpointType" jdbcType="OTHER"/>
<arg column="target_url" javaType="java.net.URI"/>
@timrobertson100
timrobertson100 / lineage.txt
Last active October 31, 2017 10:55
Lineage example
interpretedOccurrence: {
id: 123,
decimalLatitude: 12.3445,
lineage: [
{
field: decimalLatitude,
source: rawOccurrence,
fields: [verbatimLatitude, verbatimLongitude, decimalLatitude, decimalLongitude, geodeticDatum, country, stateProvince],
services: [
{
@timrobertson100
timrobertson100 / NubSpeciesMatch.java
Created October 12, 2017 13:27
Retrofit example
package org.gbif.pipelines.core.functions;
import org.gbif.dwca.record.StarRecord;
import org.gbif.pipelines.core.functions.ws.gbif.SpeciesMatchClient;
import org.gbif.pipelines.io.avro.ExtendedRecord;
import org.gbif.pipelines.io.avro.TypedOccurrence;
import org.gbif.pipelines.io.avro.UntypedOccurrence;
import java.io.File;
import java.io.IOException;
@timrobertson100
timrobertson100 / term-freq.sql
Created September 29, 2017 20:34
Verbatim DwC Term Frequency (Occurrences, verbatim fields)
FROM prod_d.occurrence_hdfs
SELECT
sum(CASE WHEN v_type IS NULL THEN 0 ELSE 1 END) AS type ,
sum(CASE WHEN v_modified IS NULL THEN 0 ELSE 1 END) AS modified ,
sum(CASE WHEN v_language IS NULL THEN 0 ELSE 1 END) AS language ,
sum(CASE WHEN v_license IS NULL THEN 0 ELSE 1 END) AS license ,
sum(CASE WHEN v_rightsHolder IS NULL THEN 0 ELSE 1 END) AS rightsHolder ,
sum(CASE WHEN v_accessRights IS NULL THEN 0 ELSE 1 END) AS accessRights ,
sum(CASE WHEN v_bibliographicCitation IS NULL THEN 0 ELSE 1 END) AS bibliographicCitation ,
sum(CASE WHEN v_references IS NULL THEN 0 ELSE 1 END) AS references ,
@timrobertson100
timrobertson100 / summary.md
Last active August 21, 2017 14:57
Bug report for Oozie hive shared lib

Oozie shared lib for hive has joda-time inconsistencies

Reported through private contact (timrobertson100@gmail.com)

CDH 5.12.0 oozie/hive tasks can fail intermittently when array field types exist on the MapReduce engine. The stack trace of a failing task is below.

I believe the oozie hive shared lib provides a non deterministic classpath. joda-time-2-1.jar exists explicitly, but hive-exec.jar is a fat jar also including joda-time but without it being relocated to a different package. I believe it is version 1.6 due to the hive-common transient dependency.

Our workaround has been to duplicate the standard Cloudera manager installed Oozie shared lib, but remove the joda-time-2-1.jar. I am unsure if this will affect execution on Spark.

@timrobertson100
timrobertson100 / Results.tsv
Created August 14, 2017 19:08
Export SQL and results for the GBIF redirects
/newsroom/uses/stanton-2014 /data-use/82531
/newsroom/uses/brown-et-al-2015 /data-use/82532
/newsroom/uses/2015-alter /data-use/82553
/newsroom/uses/2015-alter-et-al /data-use/82555
/newsroom/uses/2015-garcia-rosello-et-al /data-use/82563
/newsroom/uses/2015-escobar-et-al /data-use/82565
/newsroom/uses/2015-silva-rocha-et-al /data-use/82600
/newsroom/uses/2015-adhikari-et-al /data-use/82867
/newsroom/uses/2015-aguiar-et-al /data-use/82869
/newsroom/uses/2015-clavero-et-al /data-use/82870
@timrobertson100
timrobertson100 / documents.txt
Created August 14, 2017 15:51
Document redirects
+---------------------------------+-----------------+
| source | target |
+---------------------------------+-----------------+
| /resource/80496 | /document/80496 |
| /resource/80497 | /document/80497 |
| /resource/80498 | /document/80498 |
| /resource/80499 | /document/80499 |
| /resource/80500 | /document/80500 |
| /resource/80501 | /document/80501 |
| /resource/80502 | /document/80502 |
+---------------------------------------------------------------------------+-------------+
| source | target |
+---------------------------------------------------------------------------+-------------+
| /page/29 | /news/82292 |
| /page/30 | /news/82293 |
| /page/57 | /news/82294 |
| /page/62 | /news/82295 |
| /page/63 | /news/82296 |
| /page/64 | /news/82297 |
| /page/149 | /news/82298 |
+-----------------------------------------------+-----------------+
| source | target |
+-----------------------------------------------+-----------------+
| /newsroom/uses/2015-adhikari-et-al | /data-use/82867 |
| /newsroom/uses/2015-aguiar-et-al | /data-use/82869 |
| /newsroom/uses/2015-alimi-et-al | /data-use/82916 |
| /newsroom/uses/2015-alter | /data-use/82553 |
| /newsroom/uses/2015-alter-et-al | /data-use/82555 |
| /newsroom/uses/2015-antonelli-et-al | /data-use/82874 |
| /newsroom/uses/2015-baltensperger-et-al | /data-use/82912 |
@timrobertson100
timrobertson100 / GBIF-Densities.md
Created August 9, 2017 18:29
SQL for Exports for Tom A.

Please review SQL before using the data:

One degree

CREATE TABLE tim.one_deg ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' AS
SELECT 
  floor(decimalLatitude) AS lat,
  floor(decimalLongitude) AS lng,
  count(*) AS total
FROM