Skip to content

Instantly share code, notes, and snippets.

@abajwa-hw
Last active January 10, 2016 22:48
Show Gist options
  • Save abajwa-hw/59eababe93a2e3846e68 to your computer and use it in GitHub Desktop.
Save abajwa-hw/59eababe93a2e3846e68 to your computer and use it in GitHub Desktop.
Import contoso to psql

EDW optimization and Single view lab

EDW optimization

  • Goal: demonstrate how you can bulk import data from EDW/RDBMS into Hive and then incrementally keep the Hive tables updated periodically

Pre-requisites

    1. Download contoso data set and MSSQL schema from here into /tmp on sandbox
cd /tmp
wget https://www.dropbox.com/s/r70i8j1ujx4h7j8/data.zip
unzip data.zip
    1. As postgres user, login to Postgres and complete below to setup psql for user it1:
    • create contoso db
    • create it1 user
    • grant privileges on contoso to it1
    • check user got created
su postgres
psql
create database contoso;
CREATE USER it1 WITH PASSWORD 'it1';
GRANT ALL PRIVILEGES ON DATABASE contoso to it1;
\du
\q
exit
    1. As root, complete below to complete setup of it1 user
    • enable it1 user to login to psql by editing pg_hba.conf and add a line with: host all all 127.0.0.1/32 md5
service ambari stop
service postgresql stop
echo "host all all 127.0.0.1/32 md5" >> /var/lib/pgsql/data/pg_hba.conf
service postgresql start
service ambari start
  • sudo as hdfs to create home dir for it1 and set ownership
sudo -u hdfs hdfs dfs -mkdir /user/it1
sudo -u hdfs hdfs dfs -chown it1 /user/it1
    1. As root, setup Sqoop for postgres by downloading the appropriate JDBC jar from here e.g. "JDBC42 Postgresql Driver, Version 9.4-1207". Note: to confirm what version of postgres you have, run the following via psql: SELECT version();
wget https://jdbc.postgresql.org/download/postgresql-9.4.1207.jar -P /usr/hdp/current/sqoop-client/lib

Bulk import of data into Hive from RDBMS

Next set of steps will be run as it1 user

    1. As it1 user connect to psql and create/import data from downloaded sample data (this may take a few minutes)
su - it1
export PGPASSWORD=it1
psql -U it1 -d contoso -h localhost -f contoso-psql.sql

    1. Ensure sqoop can access tables in contoso db as it1 user
sqoop list-tables --connect jdbc:postgresql://localhost:5432/contoso --username it1 --password it1 -- schema contoso 
    1. Make sure Hive service is up via Ambari IU and start the bulk load of all the PSql tables into hive (as text) using Sqoop. This will run for some time.
sqoop import-all-tables --username it1 --password it1 --connect jdbc:postgresql://localhost:5432/contoso  --hive-import  --direct
CREATE TABLE `factsales_final` (
`SalesKey` int ,
`DateKey` timestamp ,  
`channelKey` int ,  
`StoreKey` int,
`ProductKey` int,
`PromotionKey` int,
`CurrencyKey` int,
`UnitCost` float,
`UnitPrice` float,
`SalesQuantity` int , 
`ReturnQuantity` int,
`ReturnAmount` float,
`DiscountQuantity` int,
`DiscountAmount` float,
`TotalCost` float,
`SalesAmount` float,
`ETLLoadID` int,
`LoadDate` timestamp , 
`UpdateDate` timestamp 
 )
clustered by (saleskey) into 7 buckets
stored as orc
TBLPROPERTIES ('transactional'='true')
;

insert into factsales_final select * from factsales;

Incremental import of data into Hive from RDBMS

  • Now that we did the one time bulk import, next we will setup an incremental sqoop job

    1. create password file containing it1 user's password in HDFS. This is done to allow invocations of the job to be automated/scheduled (without having to manually pass the password )
# use -n to ensure newline is not added
echo -n "it1" > .password
hadoop fs -put .password /user/it1/
rm .password
    1. create incremental import sqoop job for factsales table and point it as below:
    • --table: table the job is for (i.e. factsales)
    • --password-file: the HDFS location of the password file
    • --incremental: lastmodified (we want to use lastmodified logic to find delta records)
    • --check-column: specify which column that will be used to determine which delta records will be picked up (in this case, records whose updatedate column value is later than 2015-01-01 will be picked up)
    • see Sqoop documentation on incremental imports for more details
sqoop job -create factsales -- import --verbose --connect 'jdbc:postgresql://localhost:5432/contoso' --table factsales -username it1 --password-file hdfs://sandbox.hortonworks.com:8020/user/it1/.password --check-column updatedate --incremental lastmodified --last-value '2015-01-01' --hive-import  --direct
    1. Update records in factsales table in postgres
psql -U it1 -d contoso -h localhost -c "update factsales set updatedate = '2016-01-01 00:00:00' where saleskey in (1,2);"
    1. In Hive, truncate staging table by running below in Hive view
truncate table factsales;
    1. run incremental sqoop job for factsales to import updated records from postgres into hive staging table
sqoop job -exec factsales
    1. In Hive, check only records we changed were picked up in the hive staging table
SELECT * FROM default.factsales;
    1. In Hive, move data from staging table to final table (one at a time, using Hive view)
    • first remove the records from final table that are also found in staging table
    • move data from staging table to final table
    • truncate staging table
delete from factsales_final where saleskey in (select saleskey from factsales);
insert into factsales_final select * from factsales;
truncate table factsales;
    1. In Hive, check the records updated in hive final table
select * from  factsales_final where saleskey in (1,2);
  • At this point we have shown how you can bulk import data from EDW/RDBMS into Hive and then incrementally keep the Hive tables updated periodically
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment