Skip to content

Instantly share code, notes, and snippets.

@bsmedberg
Created August 10, 2012 18:04
Show Gist options
  • Save bsmedberg/3316220 to your computer and use it in GitHub Desktop.
Save bsmedberg/3316220 to your computer and use it in GitHub Desktop.
How do I use INDEXOF to filter in pig?
REGISTER 'socorro-toolbox-0.1-SNAPSHOT.jar'
REGISTER 'lib/akela-0.4-SNAPSHOT.jar'
SET pig.logfile improveskiplist.log;
SET default_parallel 2;
SET pig.tmpfilecompression true;
SET pig.tmpfilecompression.codec lzo;
DEFINE JsonMap com.mozilla.pig.eval.json.JsonMap();
DEFINE LookupFirstSourceFrame com.mozilla.socorro.pig.eval.LookupFirstSourceFrame();
raw = LOAD 'hbase://crash_reports' USING com.mozilla.pig.load.HBaseMultiScanLoader('$start_date', '$end_date',
'yyMMdd',
'meta_data:json,processed_data:json',
'true') AS
(k:bytearray, meta_json:chararray, processed_json:chararray);
genmap = FOREACH raw GENERATE k, JsonMap(processed_json) AS processed_json_map:map[];
tr = FOREACH genmap GENERATE
k,
processed_json_map#'signature' AS signature,
LookupFirstSourceFrame(processed_json_map#'dump',
(int) processed_json_map#'crashedThread') AS betterSignature;
flt = FILTER tr BY (INDEXOF(betterSignature, signature) is not null);
grouped = GROUP flt BY (signature, betterSignature);
summary = FOREACH grouped GENERATE FLATTEN(group), COUNT(flt);
STORE summary INTO 'improveskiplist-$start_date-$end_date' USING PigStorage();
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment