Skip to content

Instantly share code, notes, and snippets.

@milindjagre
Created May 13, 2017 20:24
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save milindjagre/2b6fde437b5cb283113fdc1f9f1a1d17 to your computer and use it in GitHub Desktop.
Save milindjagre/2b6fde437b5cb283113fdc1f9f1a1d17 to your computer and use it in GitHub Desktop.
this pig script is used for performing the join operation between customers and orders data
--JOIN OPERATION IN APACHE PIG
--loading customers' data in customers relation
customers = LOAD '/hdpcd/input/post22/post22_customers.csv' USING PigStorage(',');
--loading orders' data in orders relation
orders = LOAD '/hdpcd/input/post22/post22_orders.csv' USING PigStorage(',');
--performing join operation based on customer ID
--customer ID is the first column in customers relation, therefore $0
--customer ID is the third column in orders relation, therefore $2
joined_data = JOIN customers BY $0, orders BY $2;
--generating output data with FOREACH...GENERATE command
--output contains customers' first name, last name, order ID, and payment status of the order
output_data = FOREACH joined_data GENERATE $1 AS fname, $2 AS lname, $8 AS orderid,$12 AS payment_status;
--storing the final output in HDFS
STORE output_data INTO '/hdpcd/output/post22/';
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment