Skip to content

Instantly share code, notes, and snippets.

@abishekk92
Created February 12, 2013 09:10
Show Gist options
  • Save abishekk92/4761111 to your computer and use it in GitHub Desktop.
Save abishekk92/4761111 to your computer and use it in GitHub Desktop.
Pig Script for keys with empty UPC/MPN
SET job.name ' General Data dump for $out, Job date: $date';
SET default_parallel 4;
REGISTER 'lib/datalight-0.5.jar'; -- register jar
DEFINE attrib com.indix.datalight.pig.udf.AttributeParserUdf(); -- define shortname
raw = LOAD 'webpage' USING org.apache.pig.backend.hadoop.hbase.HBaseStorage('f:bas a:a d:*','-caching 100 -loadKey true -gte $startKey -lte $endKey') as (key:chararray, baseUrl:chararray, attributes:chararray, parsedData:map[]);
data = FOREACH raw GENERATE key,baseUrl, attrib(attributes, parsedData) as productModel:map[];
rows_without_mpn = FILTER data BY (productModel#'upc'==NULL) AND (productModel#'mpn'==NULL);
final= FOREACH rows_without_mpn GENERATE key,baseUrl;
STORE final INTO 'pigdata/datadumps/general/$out/$date.gz/' USING PigStorage('\t');
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment