Last active
May 6, 2017 18:35
-
-
Save danosipov/5710708 to your computer and use it in GitHub Desktop.
Apache Pig demo script
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
-- NOTES: | |
-- double dash denotes comments | |
-- $ denotes shell command | |
-- everything else is Pig Latin, executed in Grunt | |
-- Data set downloaded from http://www.ncdc.noaa.gov | |
-- Load data into Hadoop | |
$ hadoop fs -put ./input.txt input.txt | |
$ hadoop fs -ls | |
$ hadoop fs -tail hdfs://localhost.localdomain:8020/user/cloudera/input.txt | |
$ pig | |
-- Load | |
rawData = LOAD '*.txt' USING PigStorage(',') AS (station:int, wban:int, date: chararray, temp: double, temp_count: int, dewp: double, dewp_count:int, slp: double, slp_count: int, stp: double, stp_count: int, visibility: double, visibility_count: int, wind: double, wind_count: int, wind_max: double, wind_gust: int, temp_max: chararray, temp_min: chararray, precipitation: chararray, snow: double, frshtt: chararray); | |
rawDataSample = LIMIT rawData 10; | |
DUMP rawDataSample; | |
-- You should now see 10 sample records with the correct schema. | |
-- Let's do some manipulations | |
snowDays = FILTER rawData BY snow < 999.9; | |
snowDaysOrdered = ORDER snowDays BY snow DESC; | |
snowDaysOrderedLimited = LIMIT snowDaysOrdered 10; | |
snowSummary = FOREACH snowDaysOrderedLimited GENERATE station, REGEX_EXTRACT(date, '(\\d{4})', 1) as year, snow; | |
DUMP snowSummary; | |
temp = FILTER rawData BY temp < 999.9; | |
tempWithYear = FOREACH temp GENERATE REGEX_EXTRACT(date, '(\\d{4})', 1) as year, temp; | |
tempByYear = GROUP tempWithYear BY year; | |
DESCRIBE tempByYear | |
avgByYear = FOREACH tempByYear GENERATE group, AVG(tempWithYear.temp) AS averageTemp; | |
avgByYearOrdered = ORDER avgByYear BY averageTemp; | |
EXPLAIN avgByYearOrdered; | |
DUMP avgByYearOrdered; |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment