abajwa-hw/contoso.md

## contoso.md

      
    Raw
  

              contoso.md
            
          
    EDW optimization and Single view lab

EDW optimization


Goal: demonstrate how you can bulk import data from EDW/RDBMS into Hive and then incrementally keep the Hive tables updated periodically

Pre-requisites


Download contoso data set and MSSQL schema from here into /tmp on sandbox


cd /tmp
wget https://www.dropbox.com/s/r70i8j1ujx4h7j8/data.zip
unzip data.zip


As postgres user, login to Postgres and complete below to setup psql for user it1:


create contoso db
create it1 user
grant privileges on contoso to it1
check user got created


su postgres
psql
create database contoso;
CREATE USER it1 WITH PASSWORD 'it1';
GRANT ALL PRIVILEGES ON DATABASE contoso to it1;
\du
\q
exit


As root, complete below to complete setup of it1 user


enable it1 user to login to psql by editing pg_hba.conf and add a line with: host all all 127.0.0.1/32 md5


service ambari stop
service postgresql stop
echo "host all all 127.0.0.1/32 md5" >> /var/lib/pgsql/data/pg_hba.conf
service postgresql start
service ambari start


sudo as hdfs to create home dir for it1 and set ownership

sudo -u hdfs hdfs dfs -mkdir /user/it1
sudo -u hdfs hdfs dfs -chown it1 /user/it1


As root, setup Sqoop for postgres by downloading the appropriate JDBC jar from here e.g. "JDBC42 Postgresql Driver, Version 9.4-1207". Note: to confirm what version of postgres you have, run the following via psql: SELECT version();


wget https://jdbc.postgresql.org/download/postgresql-9.4.1207.jar -P /usr/hdp/current/sqoop-client/lib


Bulk import of data into Hive from RDBMS

Next set of steps will be run as it1 user


As it1 user connect to psql and create/import data from downloaded sample data (this may take a few minutes)


su - it1
export PGPASSWORD=it1
psql -U it1 -d contoso -h localhost -f contoso-psql.sql


Ensure sqoop can access tables in contoso db as it1 user


sqoop list-tables --connect jdbc:postgresql://localhost:5432/contoso --username it1 --password it1 -- schema contoso 


Make sure Hive service is up via Ambari IU and start the bulk load of all the PSql tables into hive (as text) using Sqoop. This will run for some time.


sqoop import-all-tables --username it1 --password it1 --connect jdbc:postgresql://localhost:5432/contoso  --hive-import  --direct


Ideally we would now convert all tables to final ORC tables in Hive. In this lab, we are showing how to do this for factsales table:


Run below using the Hive view in Ambari (http://sandbox.hortonworks.com:8080/#/main/views/HIVE/1.0.0/Hive), one sql at a time:


CREATE TABLE `factsales_final` (
`SalesKey` int ,
`DateKey` timestamp ,  
`channelKey` int ,  
`StoreKey` int,
`ProductKey` int,
`PromotionKey` int,
`CurrencyKey` int,
`UnitCost` float,
`UnitPrice` float,
`SalesQuantity` int , 
`ReturnQuantity` int,
`ReturnAmount` float,
`DiscountQuantity` int,
`DiscountAmount` float,
`TotalCost` float,
`SalesAmount` float,
`ETLLoadID` int,
`LoadDate` timestamp , 
`UpdateDate` timestamp 
 )
clustered by (saleskey) into 7 buckets
stored as orc
TBLPROPERTIES ('transactional'='true')
;

insert into factsales_final select * from factsales;


Incremental import of data into Hive from RDBMS


Now that we did the one time bulk import, next we will setup an incremental sqoop job


create password file containing it1 user's password in HDFS. This is done to allow invocations of the job to be automated/scheduled (without having to manually pass the password )


# use -n to ensure newline is not added
echo -n "it1" > .password
hadoop fs -put .password /user/it1/
rm .password


create incremental import sqoop job for factsales table and point it as below:


--table: table the job is for (i.e. factsales)
--password-file: the HDFS location of the password file
--incremental: lastmodified (we want to use lastmodified logic to find delta records)
--check-column: specify which column that will be used to determine which delta records will be picked up (in this case, records whose updatedate column value is later than 2015-01-01 will be picked up)
see Sqoop documentation on incremental imports for more details


sqoop job -create factsales -- import --verbose --connect 'jdbc:postgresql://localhost:5432/contoso' --table factsales -username it1 --password-file hdfs://sandbox.hortonworks.com:8020/user/it1/.password --check-column updatedate --incremental lastmodified --last-value '2015-01-01' --hive-import  --direct


Update records in factsales table in postgres


psql -U it1 -d contoso -h localhost -c "update factsales set updatedate = '2016-01-01 00:00:00' where saleskey in (1,2);"


In Hive, truncate staging table by running below in Hive view


truncate table factsales;


run incremental sqoop job for factsales to import updated records from postgres into hive staging table


sqoop job -exec factsales


In Hive, check only records we changed were picked up in the hive staging table


SELECT * FROM default.factsales;


In Hive, move data from staging table to final table (one at a time, using Hive view)


first remove the records from final table that are also found in staging table
move data from staging table to final table
truncate staging table


delete from factsales_final where saleskey in (select saleskey from factsales);
insert into factsales_final select * from factsales;
truncate table factsales;


In Hive, check the records updated in hive final table


select * from  factsales_final where saleskey in (1,2);


At this point we have shown how you can bulk import data from EDW/RDBMS into Hive and then incrementally keep the Hive tables updated periodically