Skip to content

Instantly share code, notes, and snippets.

@danielfrg
Last active December 26, 2015 22:08
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save danielfrg/7220473 to your computer and use it in GitHub Desktop.
Save danielfrg/7220473 to your computer and use it in GitHub Desktop.
Example on how to run a Jython UDF in AWS EMR The example loads a list of urls, query each url and saves the output. Pig version: 0.11
Register utils.py using jython as utils;
urls = LOAD 'INPUT_FILE' USING PigStorage('\t') AS (url:chararray);
query = FOREACH urls GENERATE utils.query(url) AS everything;
file = FOREACH query GENERATE FLATTEN(everything);
STORE file INTO 's3n://OUTPUT_DIR' USING PigStorage('\t');
import urllib2
@outputSchema('everything:chararray')
def query(url):
try:
response = urllib2.urlopen(url)
ans = response.read()
# Need to remove new lines and tabs so TDF saved is correct
ans = ans.replace('\\n', '').replace('\n', '').replace('\\t', '').replace('\t', '')
return ans
except:
pass
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment