Skip to content

Instantly share code, notes, and snippets.

@dsc
Last active December 15, 2015 07:29
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save dsc/5224205 to your computer and use it in GitHub Desktop.
Save dsc/5224205 to your computer and use it in GitHub Desktop.
SET default_parallel 2;
%default month 02
IMPORT 'hdfs:///libs/kraken/pig/include/load_webrequest.pig';
log_fields = LOAD_WEBREQUEST('hdfs:///wmf/raw/webrequest/webrequest-wikipedia-mobile/2013-02-*,hdfs:///wmf/raw/webrequest/webrequest-wikipedia-mobile/2013-03-*');
-- log_fields = LOAD_WEBREQUEST('hdfs:///wmf/raw/webrequest/webrequest-wikipedia-mobile/2013-$month-*');
xcs_cachehost = FOREACH log_fields GENERATE timestamp, hostname, x_cs PARALLEL 2;
xcs_cachehost = FILTER xcs_cachehost BY ( NOT ((x_cs is null) OR (x_cs == '') OR (x_cs == '-')) ) PARALLEL 2;
xcs_cachehost = FOREACH (GROUP xcs_cachehost BY (x_cs, hostname)) GENERATE FLATTEN($0), COUNT($1) as num:int PARALLEL 2;
-- STORE xcs_cachehost INTO '/user/dsc/zero/xcs_cachehost/2013/$month';
STORE xcs_cachehost INTO '/user/dsc/zero/xcs_cachehost/2013/feb-march';
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment