Skip to content

Instantly share code, notes, and snippets.

View misterhon's full-sized avatar

Alex Hon misterhon

  • New York, New York
View GitHub Profile
@elliottcordo
elliottcordo / yelp_pig_join.pig
Created October 28, 2014 04:00
yelp_pig_join
REGISTER 's3://caserta-bucket1/libs/elephant-bird-pig.jar'
REGISTER 's3://caserta-bucket1/libs/elephant-bird-core.jar'
REGISTER 's3://caserta-bucket1/libs/elephant-bird-hadoop-compat.jar'
REGISTER 's3://caserta-bucket1/libs/json-simple.jar'
business = LOAD 's3://caserta-bucket1/yelp-academic-dataset/yelp_academic_dataset_business.json'
USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad');
business_cleaned = FOREACH business
@elliottcordo
elliottcordo / yelp_pyspark_example.py
Last active August 29, 2015 14:08
yelp pyspark example
#MASTER=yarn-client /home/hadoop/spark/bin/pyspark
from pyspark.sql import SQLContext, Row
sqlContext = SQLContext(sc)
#------------------------------------------------
#load some users
lines=sc.textFile("s3://caserta-bucket1/yelp/in/users/users.txt")
parts = lines.map(lambda l: l.split(","))