Skip to content

Instantly share code, notes, and snippets.

@calvincorreli
Created November 21, 2010 15:25
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save calvincorreli/708810 to your computer and use it in GitHub Desktop.
Save calvincorreli/708810 to your computer and use it in GitHub Desktop.
Parse cloudfront log for non-streaming distribution and count bandwidth per object that matches /uploads/assets/file/...
register file:/home/hadoop/lib/pig/piggybank.jar
raw_logs =
LOAD '$INPUT'
USING PigStorage('\t')
AS (
date: chararray, time: chararray, x_edge_location: chararray, sc_bytes: int,
c_ip: chararray, cs_method: chararray, cs_host: chararray, cs_uri_stem: chararray,
sc_status: chararray, cs_referer: chararray, cs_user_agent:chararray, cs_uri_query: chararray
);
raw_logs_without_comments = FILTER raw_logs BY (date matches '^[^#].*');
all_logs = FOREACH raw_logs_without_comments GENERATE cs_uri_stem, sc_bytes;
logs = FILTER all_logs BY (cs_url_stem matches '^/uploads/asset/file/.*');
by_uri_stem_bytes =
FOREACH
(GROUP logs BY cs_uri_stem)
GENERATE
$0,
SUM($1.sc_bytes) AS num_bytes
;
STORE by_uri_stem_bytes INTO '$OUTPUT';
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment