Last active
August 29, 2015 14:02
-
-
Save alexanderdean/d8371cebdf00064591ae to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
input_lines = LOAD '$INPUT' AS (line:chararray); | |
-- Extract words from each line and put them into a pig bag | |
-- datatype, then flatten the bag to get one word on each row | |
words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word; | |
-- filter out any words that are just white spaces | |
filtered_words = FILTER words BY word MATCHES '\\w+'; | |
-- create a group for each word | |
word_groups = GROUP filtered_words BY word; | |
-- count the entries in each group | |
word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word; | |
-- order the records by count | |
ordered_word_count = ORDER word_count BY count DESC; | |
STORE ordered_word_count INTO '$OUTPUT'; |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment