Skip to content

Instantly share code, notes, and snippets.

@danosipov
Last active May 6, 2017 18:35
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save danosipov/5710708 to your computer and use it in GitHub Desktop.
Save danosipov/5710708 to your computer and use it in GitHub Desktop.
Apache Pig demo script
-- NOTES:
-- double dash denotes comments
-- $ denotes shell command
-- everything else is Pig Latin, executed in Grunt
-- Data set downloaded from http://www.ncdc.noaa.gov
-- Load data into Hadoop
$ hadoop fs -put ./input.txt input.txt
$ hadoop fs -ls
$ hadoop fs -tail hdfs://localhost.localdomain:8020/user/cloudera/input.txt
$ pig
-- Load
rawData = LOAD '*.txt' USING PigStorage(',') AS (station:int, wban:int, date: chararray, temp: double, temp_count: int, dewp: double, dewp_count:int, slp: double, slp_count: int, stp: double, stp_count: int, visibility: double, visibility_count: int, wind: double, wind_count: int, wind_max: double, wind_gust: int, temp_max: chararray, temp_min: chararray, precipitation: chararray, snow: double, frshtt: chararray);
rawDataSample = LIMIT rawData 10;
DUMP rawDataSample;
-- You should now see 10 sample records with the correct schema.
-- Let's do some manipulations
snowDays = FILTER rawData BY snow < 999.9;
snowDaysOrdered = ORDER snowDays BY snow DESC;
snowDaysOrderedLimited = LIMIT snowDaysOrdered 10;
snowSummary = FOREACH snowDaysOrderedLimited GENERATE station, REGEX_EXTRACT(date, '(\\d{4})', 1) as year, snow;
DUMP snowSummary;
temp = FILTER rawData BY temp < 999.9;
tempWithYear = FOREACH temp GENERATE REGEX_EXTRACT(date, '(\\d{4})', 1) as year, temp;
tempByYear = GROUP tempWithYear BY year;
DESCRIBE tempByYear
avgByYear = FOREACH tempByYear GENERATE group, AVG(tempWithYear.temp) AS averageTemp;
avgByYearOrdered = ORDER avgByYear BY averageTemp;
EXPLAIN avgByYearOrdered;
DUMP avgByYearOrdered;
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment