Skip to content

Instantly share code, notes, and snippets.

View rjurney's full-sized avatar

Russell Jurney rjurney

View GitHub Profile
@rjurney
rjurney / MongoStorage.java
Created January 1, 2012 03:46
MongoStorage.java that does complex types
/*
* Copyright 2011 10gen Inc.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
@rjurney
rjurney / pig-fails-when-ordered
Created January 21, 2012 04:56
Ordered version that fails
REGISTER /me/pig/build/ivy/lib/Pig/avro-1.5.3.jar
REGISTER /me/pig/build/ivy/lib/Pig/json-simple-1.1.jar
REGISTER /me/pig/contrib/piggybank/java/piggybank.jar
REGISTER /me/pig/build/ivy/lib/Pig/jackson-core-asl-1.7.3.jar
REGISTER /me/pig/build/ivy/lib/Pig/jackson-mapper-asl-1.7.3.jar
REGISTER /me/mongo-hadoop/mongo-2.3.jar
REGISTER /me/mongo-hadoop/core/target/mongo-hadoop-core-1.0-SNAPSHOT.jar
REGISTER /me/mongo-hadoop/pig/target/mongo-pig-1.0-SNAPSHOT.jar
DEFINE AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage();
@rjurney
rjurney / gist:1651347
Created January 21, 2012 04:56
order error message for pig
2012-01-20 20:55:19,089 [Thread-71] INFO org.apache.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
2012-01-20 20:55:19,094 [Thread-71] WARN org.apache.hadoop.mapred.LocalJobRunner - job_local_0004
java.lang.RuntimeException: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: file:/Users/rjurney/Collecting-Data/pigsample_1901945818_1327121718252
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.partitioners.WeightedRangePartitioner.setConf(WeightedRangePartitioner.java:156)
at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:62)
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117)
at org.apache.hadoop.mapred.MapTask$NewOutputCollector.<init>(MapTask.java:527)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:613)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.LocalJobRunner$Job.ru
@rjurney
rjurney / pairs.pig
Created February 2, 2012 06:34
A macro to generate pairs of people linked by email communications, given an email that links them
/* Filter emails according to existence of header pairs, from and [to, cc, bcc]
project the pairs (may be more than one to/cc/bcc), then emit them, lowercased. */
DEFINE header_pairs(email, col1, col2) RETURNS pairs {
filtered = FILTER $email BY ($col1 IS NOT NULL) AND ($col2 IS NOT NULL);
flat = FOREACH filtered GENERATE FLATTEN($col1) AS $col1, FLATTEN($col2) AS $col2;
$pairs = FOREACH flat GENERATE LOWER($col1), LOWER($col2);
}
pairs = header_pairs(email)
@rjurney
rjurney / total_count.macro
Created February 7, 2012 01:35
Pig macro to count records in a relation
/* Get a count of records. */
DEFINE total_count(relation) RETURNS total {
$total = FOREACH (group $relation all) generate $relation as label, COUNT_STAR($relation) as total;
}
@rjurney
rjurney / ashamed.rb
Created February 9, 2012 03:17
Ugly as hell Ruby
get '/sent_distributions/:email' do |@email|
raw_data = mongo['sentdist'].find_one({:email => @email})['sent_dist']
puts JSON raw_data
@data = (0..23).map do |hour|
key = to_key(hour)
puts "Key: |#{key}|"
value = Integer
index = raw_data.find_index{ |record| record['sent_hour'] == key }
if index
value = raw_data[index]['total']
@rjurney
rjurney / bootstrap.erb
Created February 9, 2012 05:20
Bootstrap won't change colors of buttons to the ones in the docs :(
<div class="nav nav-pills">
<% @data.each do |d| -%>
<a style="margin: 3px;" class="btn btn-primary btn-large" href="/sent_distributions/<%= d['to'] -%>"><%= d['to'] -%></a>
<% end -%>
</div>
Result: grey buttons, but they should be blue :(
@rjurney
rjurney / test_bag.pig
Created February 16, 2012 20:00
Returning a bag from a Jython Pig UDF
register /me/pig/build/ivy/lib/Pig/avro-1.5.3.jar
register /me/pig/build/ivy/lib/Pig/json-simple-1.1.jar
register /me/pig/contrib/piggybank/java/piggybank.jar
register /me/pig/build/ivy/lib/Pig/jackson-core-asl-1.7.3.jar
register /me/pig/build/ivy/lib/Pig/jackson-mapper-asl-1.7.3.jar
define AvroStorage org.apache.pig.piggybank.storage.avro.AvroStorage();
register 'udfs.py' using jython as myfuncs;
rmf /tmp/jython_test.txt
@rjurney
rjurney / reproduce_avro_565.py
Created April 14, 2012 02:46
Reproducing AVRO-565
from avro import schema, datafile, io
# Simplified to include only offending characters from Brazil, with on charset in the email header.
email_hash = {'body': "Verit\xc3\xa1\r\nEstat\xc3\xadstica\r\n"}
out_filename = '565.avro'
schema_string = """
{
"namespace": "agile.data.avro",
"name": "Email",
@rjurney
rjurney / mypig.pig
Created May 11, 2012 18:14
Pasting multiple globs -> One command
/* Piggybank */
register /me/pig/contrib/piggybank/java/piggybank.jar
/* Load Avro jars and define shortcut */
register /me/pig/build/ivy/lib/Pig/avro-1.5.3.jar
register /me/pig/build/ivy/lib/Pig/json-simple-1.1.jar
register /me/pig/build/ivy/lib/Pig/jackson-core-asl-1.7.3.jar
register /me/pig/build/ivy/lib/Pig/jackson-mapper-asl-1.7.3.jar
register /me/pig/build/ivy/lib/Pig/joda-time-1.6.jar