Skip to content

Instantly share code, notes, and snippets.

@ddaniels
Created September 26, 2013 14:50
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ddaniels/e67d6f280b68202471e4 to your computer and use it in GitHub Desktop.
Save ddaniels/e67d6f280b68202471e4 to your computer and use it in GitHub Desktop.
NLTK pig script using streaming python
REGISTER ‘<python_file>’ USING streaming_python AS nltk_udfs;
tweets = LOAD 's3n://twitter-gardenhose-mortar/tweets'
USING org.apache.pig.piggybank.storage.JsonLoader(
'text: chararray, place:tuple(name:chararray)');
-- Group the tweets by place name and use a CPython UDF to find the top 5 bigrams
-- for each of these places.
bigrams_by_place = FOREACH (GROUP tweets BY place.name) GENERATE
group AS place:chararray,
nltk_udfs.top_5_bigrams(tweets.text),
COUNT(tweets) AS sample_size;
top_100_places = LIMIT (ORDER bigrams_by_place BY sample_size DESC) 100;
STORE top_100_places INTO '<your_output_path>';
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment