Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
this pig file is used for transforming the input data as per the hive schema
--data transformation for matching schema with hive table
--LOAD command is used to load the data present in post14.csv file into input_data pig relation
--three columns are created while loading the data i.e. name, dob, and location
input_data = LOAD '/hdpcd/input/post14/post14.csv' USING PigStorage(',') AS (name:chararray, dob:chararray, location:chararray);
--actual data transformation operation starts now
--INPUT DATA: Milind Jagre,04/23/1991,Hartford CT US
--EXPACTED OUTPUT DATA: Milind,Jagre,4,23,1991,Hartford,CT,US
--name field is splitted into First Name and Last Name
--dob field is splitted into Month, Day, and Year for creating a clear segragation about the date fields
--location field splitted into city, state, and country for creating a clear segragation about the geographical fields
hive_data = FOREACH input_data GENERATE SUBSTRING(name, 0, INDEXOF(name, ' ', 0)) as fname,TRIM(SUBSTRING(name, INDEXOF(name, ' ', 0),100)) as lname,SUBSTRING(dob,0,2) as month,GetDay(ToDate(dob,'mm/dd/yyyy')) as day,GetYear(ToDate(dob,'mm/dd/yyyy')) as year,SUBSTRING(location, 0, INDEXOF(location, ' ', 0)) as city,TRIM(SUBSTRING(location, INDEXOF(location, ' ', 0), INDEXOF(location, ' ', 0)+3)) as state,TRIM(SUBSTRING(location, INDEXOF(location, ' ', 0)+3, INDEXOF(location, ' ', 0)+6)) as country;
--once we create schema required for hive table, we must store it in HDFS
--the data is delimited by tab character and directory is /hdpcd/output/post14
STORE hive_data INTO '/hdpcd/output/post14' USING PigStorage('\t');
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment