Skip to content

Instantly share code, notes, and snippets.

@milindjagre
Created May 11, 2017 18:55
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save milindjagre/86142fbf7240937a4326947cb3e0d034 to your computer and use it in GitHub Desktop.
Save milindjagre/86142fbf7240937a4326947cb3e0d034 to your computer and use it in GitHub Desktop.
this pig script is used for launching multiple reducer tasks using the SET command
-- this pig script is going to launch parallel reduce tasks
-- we are using SET command for doing this
-- below line launches 4 reducer tasks for doing an operation
SET default_parallel 4
-- data in post21.csv is stored in input_data pig relation using LOAD command
input_data = LOAD '/hdpcd/input/post21/post21.csv' USING PigStorage(',');
-- a SORT operation is performed using ORDER command
-- output of this command is stored in sorted_data pig relation
sorted_data = ORDER input_data BY $6 DESC;
-- sorted_data pig relation is stored in HDFS using STORE command
-- since reduce tasks are 4, there should be 4 part files in /hdpcd/output/post21_1 directory
STORE sorted_data INTO '/hdpcd/output/post21_1' USING PigStorage(':');
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment