Skip to content

Instantly share code, notes, and snippets.

@sideb0ard
Created April 16, 2014 03:23
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save sideb0ard/10802752 to your computer and use it in GitHub Desktop.
Save sideb0ard/10802752 to your computer and use it in GitHub Desktop.
Pig Nginx log parser
register file:/home/hadoop/lib/pig/piggybank.jar
DEFINE EXTRACT org.apache.pig.piggybank.evaluation.string.EXTRACT;
RAW_LOGS = LOAD 's3://apiaxle-logs/*' USING TextLoader as (line:chararray);
LOGS_BASE = FOREACH RAW_LOGS GENERATE
FLATTEN(
EXTRACT(line, '^(\\S+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] "(.+?) (.+)&api_key=(.+?)(&.+)? (.+?)" (\\S+) (\\S+) "([^"]*)" "([^"]*)"')
)
as (
remoteAddr: chararray,
remoteLogname: chararray,
user: chararray,
time: chararray,
method: chararray,
request: chararray,
api_key: chararray,
options: chararray,
httpversion: chararray,
status: int,
bytes_string: chararray,
referrer: chararray,
browser: chararray
);
A = FOREACH LOGS_BASE GENERATE api_key, request;
B = GROUP A BY (api_key, request);
C = FOREACH B GENERATE FLATTEN(group) as (api_key,request), COUNT(A) as count;
D = ORDER C BY api_key,count desc;
STORE D into 's3://hadoop-axelog-output/testrun';
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment