Skip to content

Instantly share code, notes, and snippets.

View matthayes's full-sized avatar

Matthew Hayes matthayes

  • Databricks
  • San Francisco Bay Area
View GitHub Profile
@matthayes
matthayes / quickstart1_1.sh
Last active December 24, 2015 07:59
DataFu's Hourglass: Quick Start
git clone git://git.apache.org/incubator-datafu.git datafu
cd datafu/contrib/hourglass
@matthayes
matthayes / example2_1.java
Last active December 24, 2015 07:59
DataFu's Hourglass: Example 2
Mapper<GenericRecord,GenericRecord,GenericRecord> mapper =
new Mapper<GenericRecord,GenericRecord,GenericRecord>() {
private transient Schema kSchema;
private transient Schema vSchema;
@Override
public void map(
GenericRecord input,
KeyValueCollector<GenericRecord, GenericRecord> collector)
throws IOException, InterruptedException
@matthayes
matthayes / example1_1.json
Last active December 24, 2015 07:58
DataFu's Hourglass: Example 1
{
"type" : "record", "name" : "ExampleEvent",
"namespace" : "datafu.hourglass.test",
"fields" : [ {
"name" : "id",
"type" : "long",
"doc" : "ID"
} ]
}
@matthayes
matthayes / gist:6189181
Last active December 20, 2015 20:18
When joining on more than two relations, a better option than using multiple left joins is to use a cogroup. Here we use the EmptyBagToNullFields from DataFu to make the code very concise. This code uses the insight that the input1 bag will be empty when there is no match, and flattening this removes the record. If the input2 or input3 bags are …
DEFINE EmptyBagToNullFields datafu.pig.bags.EmptyBagToNullFields();
input1 = LOAD 'input1' using PigStorage(',') AS (val1:INT,val2:INT);
input2 = LOAD 'input2' using PigStorage(',') AS (val1:INT,val2:INT);
input3 = LOAD 'input3' using PigStorage(',') AS (val1:INT,val2:INT);
data1 = COGROUP input1 BY val1, input2 BY val1, input3 BY val1;
data2 = FOREACH data1 GENERATE
FLATTEN(input1),
FLATTEN(EmptyBagToNullFields(input2)),
@matthayes
matthayes / gist:6189135
Created August 8, 2013 21:55
When joining on more than two relations, a better option than using multiple left joins is to use a cogroup. However the code gets pretty ugly. This code uses the insight that the input1 bag will be empty when there is no match, and flattening this removes the record. If the input2 or input3 bags are empty we don't want flattening them to remove…
input1 = LOAD 'input1' using PigStorage(',') AS (val1:INT,val2:INT);
input2 = LOAD 'input2' using PigStorage(',') AS (val1:INT,val2:INT);
input3 = LOAD 'input3' using PigStorage(',') AS (val1:INT,val2:INT);
data1 = COGROUP input1 BY val1, input2 BY val1, input3 BY val1;
data2 = FOREACH data1 GENERATE
FLATTEN(input1), -- left join on this
FLATTEN((IsEmpty(input2) ? TOBAG(TOTUPLE((int)null,(int)null)) : input2))
as (input2::val1,input2::val2),
FLATTEN((IsEmpty(input3) ? TOBAG(TOTUPLE((int)null,(int)null)) : input3))
@matthayes
matthayes / left_joins.pig
Created August 8, 2013 20:46
When joining on more than two relations, one option is to use multiple joins. However this means multiple MapReduce jobs, which is inefficient.
input1 = LOAD 'input1' using PigStorage(',') AS (val1:INT,val2:INT);
input2 = LOAD 'input2' using PigStorage(',') AS (val1:INT,val2:INT);
input3 = LOAD 'input3' using PigStorage(',') AS (val1:INT,val2:INT);
data1 = JOIN input1 BY val1 LEFT, input2 BY val1;
data1 = FILTER data1 BY input1::val1 IS NOT NULL;
data2 = JOIN data1 BY input1::val1 LEFT, input3 BY val1;
data2 = FILTER data2 BY input1::val1 IS NOT NULL;
@matthayes
matthayes / gist:6128024
Last active March 8, 2018 13:32
An example in Pig using the In UDF from DataFu to filter based on a field belonging to a set of values. Here the field 'adj' is tested against the set {red,blue}. This is much simpler than a list of conditions joined by OR.
DEFINE In datafu.pig.util.In();
data = LOAD 'input' using PigStorage(',') AS (what:chararray, adj:chararray);
dump data;
-- (roses,red)
-- (violets,blue)
-- (sugar,sweet)
data2 = FILTER data BY In(adj, 'red','blue');
@matthayes
matthayes / filtering_with_or.pig
Last active December 20, 2015 12:00
An example in Pig of filtering data using conditional logic. Here a tuple is accepted if 'adj' equals either 'red' or 'blue'. As the number of conditions to check for grows this can be a pain to write.
data = LOAD 'input' using PigStorage(',') AS (what:chararray, adj:chararray);
dump data;
-- (roses,red)
-- (violets,blue)
-- (sugar,sweet)
data2 = FILTER data BY adj == 'red' OR adj == 'blue';
dump data2;
@matthayes
matthayes / coalesce_null.pig
Created August 1, 2013 00:37
Example using DataFu's COALESCE to replace a value with zero if it is null.
define COALESCE datafu.pig.util.Coalesce();
data = LOAD 'input' using PigStorage(',') AS (val:INT);
dump data;
-- (1)
-- ()
data2 = FOREACH data GENERATE COALESCE(val,0) as result;
@matthayes
matthayes / ternary_null.pig
Created August 1, 2013 00:36
Example using a ternary operator to check if a value is null and replace it with zero if it is.
data = LOAD 'input' using PigStorage(',') AS (val:INT);
dump data;
-- (1)
-- ()
data2 = FOREACH data GENERATE (val IS NOT NULL ? val : 0) as result;
dump data2;
-- (1)