Skip to content

Instantly share code, notes, and snippets.

@sandys
Created June 17, 2013 10:08
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save sandys/5795938 to your computer and use it in GitHub Desktop.
Save sandys/5795938 to your computer and use it in GitHub Desktop.
PIG script with a jruby udf to read apache log data and load it into HBase.
register ./contrib/piggybank/java/piggybank.jar;
register /home/user/Code/hadoop/hbase-0.94.4/lib/zookeeper-3.4.5.jar
register /home/user/Code/hadoop/hbase-0.94.4/hbase-0.94.4.jar
register /home/user/Code/hadoop/hbase-0.94.4/lib/guava-11.0.2.jar
register /home/user/Code/hadoop/hbase-0.94.4/lib/protobuf-java-2.4.0a.jar
register ./udf.rb using jruby as uuid_udf;
/*
DEFINE A `access_log_link` CACHE('hdfs://localhost:54310/user/user/access_log#access_log_link');
set mapred.cache.localFiles 'hdfs://localhost:54310/user/user/access_log#access_log.1';
set mapred.create.symlink yes;
*/
/*DEFINE access_log_link CACHE('hdfs://localhost:54310/user/user/access_log#access_log_link');
*/
DEFINE LogLoader org.apache.pig.piggybank.storage.apachelog.CommonLogLoader();
DEFINE access_log_link2 CACHE('hdfs://localhost:54310/user/user/access_log#access_log_link');
/*log = LOAD 'hdfs://localhost:54310/user/user/access_log' USING LogLoader as (remoteAddr, remoteLogname, user, time, method,uri, proto, status, bytes, referer, userAgent);
*/
log = LOAD 'hdfs://localhost:54310/user/user/access_log' USING LogLoader as (remoteAddr, remoteLogname, user, time, method,uri, proto, status, bytes);
log_id = FOREACH log GENERATE uuid_udf.uuid(),remoteAddr, remoteLogname, user, time, method,uri, proto, status, bytes;
/*log_id = FOREACH log GENERATE CONCAT(time,CONCAT(remoteAddr, referer)),remoteAddr, remoteLogname, user, time, method,uri, proto, status, bytes;
*/
/*dump log_id;*/
store log_id into 'hbase://access' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('log:remoteAddr log:remoteLogname log:user log:time log:method log:uri log:proto log:status log:bytes');
require 'pigudf'
require 'java' # Magic line for JRuby - Java interworking
import java.util.UUID
class JRubyUdf < PigUdf
outputSchema "uuid:chararray"
def uuid()
java.util.UUID.randomUUID().toString()
end
outputSchema "type:chararray"
def type(auditRecord)
(recordPrefix, recordData) = auditRecord.split(':')
# Generate a hash of all the prefix fields: node, type and msg
prefixFields = Hash[*recordPrefix.split(' ').map {|t| t.split('=')}.flatten]
prefixFields['type'] # return the type
end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment