Skip to content

Instantly share code, notes, and snippets.

@kendravant
Created March 4, 2013 00:43
Show Gist options
  • Save kendravant/5079139 to your computer and use it in GitHub Desktop.
Save kendravant/5079139 to your computer and use it in GitHub Desktop.
Pig script for counting words in small text comments stored in one column of a flat file.
unstructuredText = load '<file name>' using PigStorage('|')
as
(
CUSTOMER_NUMBER:chararray,
VISIT_TYPE:chararray,
REVIEW_DATE:chararray,
NOTE:chararray
);
tokenized = foreach unstructuredText
generate
CUSTOMER_NUMBER,
VISIT_TYPE,
REVIEW_DATE,
FLATTEN(TOKENIZE(NOTE)) as word;
grouped = group tokenized by
(CUSTOMER_NUMBER,
VISIT_TYPE,
REVIEW_DATE);
counts = foreach grouped
generate
group, COUNT(tokenized) as wc;
store counts into 'output' using PigStorage('\t');
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment