Skip to content

Instantly share code, notes, and snippets.

@elliottcordo
Created October 28, 2014 04:00
Show Gist options
  • Save elliottcordo/45bdd460e2ec7148b6b9 to your computer and use it in GitHub Desktop.
Save elliottcordo/45bdd460e2ec7148b6b9 to your computer and use it in GitHub Desktop.
yelp_pig_join
REGISTER 's3://caserta-bucket1/libs/elephant-bird-pig.jar'
REGISTER 's3://caserta-bucket1/libs/elephant-bird-core.jar'
REGISTER 's3://caserta-bucket1/libs/elephant-bird-hadoop-compat.jar'
REGISTER 's3://caserta-bucket1/libs/json-simple.jar'
business = LOAD 's3://caserta-bucket1/yelp-academic-dataset/yelp_academic_dataset_business.json'
USING com.twitter.elephantbird.pig.load.JsonLoader('-nestedLoad');
business_cleaned = FOREACH business
generate $0#'business_id' as business_id,
REPLACE($0#'name','\\|','-') as name,
$0#'city' as city,
$0#'state' as state,
$0#'longitude' as longitude,
$0#'latitude' as latitude,
$0#'categories' as categories;
review = LOAD 's3://caserta-bucket1/yelp/in/reviews/'
USING PigStorage('|')
AS (review_id:chararray,
business_id:chararray,
user_id:chararray,
rating:float,
review_date:chararray);
review_cleaned = FOREACH review
GENERATE
review_id,
REPLACE(review_date,'-','') as review_date_id,
business_id,
user_id,
rating,
(int)(rating >= 3 ? 1 : 0) as good_review_count,
(int)(rating < 3 ? 1 : 0) as bad_review_count;
joined = JOIN review_cleaned by business_id, business_cleaned by business_id parallel 10;
grouped = GROUP joined BY name;
results = FOREACH grouped GENERATE
group,
SUM(joined.review_cleaned::good_review_count),
SUM(joined.review_cleaned::bad_review_count),
COUNT(joined);
STORE results into 's3://caserta-bucket1/yelp/out/joined/'
USING PigStorage('|');
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment