Skip to content

Instantly share code, notes, and snippets.

@alces
Created November 30, 2018 06:08
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save alces/d3132597add0b1f388d0c154f26874e6 to your computer and use it in GitHub Desktop.
Save alces/d3132597add0b1f388d0c154f26874e6 to your computer and use it in GitHub Desktop.
Count words' occurrences in a text file
data = load '$file' using TextLoader();
tokens = foreach data generate FLATTEN(TOKENIZE($0));
words = filter tokens by $0 MATCHES '[A-Za-z]+';
lowers = foreach words generate LOWER($0);
groups = group lowers by $0;
counts = foreach groups generate group, COUNT(lowers.$0);
by_count = order counts by $1 DESC;
store by_count into '$file-by-count';
by_word = order counts by $0 ASC;
store by_word into '$file-by-word';
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment