Created
March 31, 2017 18:26
-
-
Save milindjagre/9f22c0d638f50ec76c0bb723e9d559b6 to your computer and use it in GitHub Desktop.
this pig file is used for transforming the input data as per the hive schema
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--data transformation for matching schema with hive table | |
--LOAD command is used to load the data present in post14.csv file into input_data pig relation | |
--three columns are created while loading the data i.e. name, dob, and location | |
input_data = LOAD '/hdpcd/input/post14/post14.csv' USING PigStorage(',') AS (name:chararray, dob:chararray, location:chararray); | |
--actual data transformation operation starts now | |
--INPUT DATA: Milind Jagre,04/23/1991,Hartford CT US | |
--EXPACTED OUTPUT DATA: Milind,Jagre,4,23,1991,Hartford,CT,US | |
--name field is splitted into First Name and Last Name | |
--dob field is splitted into Month, Day, and Year for creating a clear segragation about the date fields | |
--location field splitted into city, state, and country for creating a clear segragation about the geographical fields | |
hive_data = FOREACH input_data GENERATE SUBSTRING(name, 0, INDEXOF(name, ' ', 0)) as fname,TRIM(SUBSTRING(name, INDEXOF(name, ' ', 0),100)) as lname,SUBSTRING(dob,0,2) as month,GetDay(ToDate(dob,'mm/dd/yyyy')) as day,GetYear(ToDate(dob,'mm/dd/yyyy')) as year,SUBSTRING(location, 0, INDEXOF(location, ' ', 0)) as city,TRIM(SUBSTRING(location, INDEXOF(location, ' ', 0), INDEXOF(location, ' ', 0)+3)) as state,TRIM(SUBSTRING(location, INDEXOF(location, ' ', 0)+3, INDEXOF(location, ' ', 0)+6)) as country; | |
--once we create schema required for hive table, we must store it in HDFS | |
--the data is delimited by tab character and directory is /hdpcd/output/post14 | |
STORE hive_data INTO '/hdpcd/output/post14' USING PigStorage('\t'); |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment