saptak/how-to-load-data-hdfs.md

## how-to-load-data-hdfs.md

      
    Raw
  

              how-to-load-data-hdfs.md
            
          
    Summary

This tutorial describes how to load data into the Hortonworks sandbox.
The Hortonworks sandbox is a fully contained Hortonworks Data Platform (HDP) environment. The sandbox includes the core Hadoop components (HDFS and MapReduce), as well as all the tools needed for data ingestion and processing. You can access and analyze sandbox data with many Business Intelligence (BI) applications.
In this tutorial, we will load and review data for a fictitious web retail store in what has become an established use case for Hadoop: deriving insights from large data sources such as web logs. By combining web logs with more traditional customer data, we can better understand our customers, and also understand how to optimize future promotions and advertising.
Prerequisites:


Hortonworks Sandbox 2.3 (installed and running)

Overview

To load data into the Hortonworks sandbox, you will:

Download sample data to your computer.
Upload the data files into the sandbox
View and refine the data in the sandbox.

Step 1: Download the Sample Data

You can download a set of sample data contained in a compressed (.zip) folder here:
RefineDemoData.zip
Save the sample data .zip file to your computer, then extract the files and unzip Omniture.0.tsv.gz, user.tsv.gz and products.tsv.gz.
Note: The extracted data files should have a .tsv file extension at the end.
Step 2: Upload the Data Files into the Sandbox

Select the HDFS Files view from the Off-canvas menu at the top. The HDFS Files view allows you to view the Hortonworks Data Platform(HDP) file store. The HDP file system is separate from the local file system.

We navigate to /tmp, create an admin folder

right click on admin and select Permissions:

Now we check the Write buttons and modify recursively and press save.

Verify that the permissions look now like this:

Now, we navigate to /tmp/admin, click on upload and browse the Omniture.0.tsv.
Repeat this procedure for users.tsv file and for products.tsv.

Step 3: Create Hive tables

Let's open the Hive View by clicking on the Hive button from the views menu.

and create the tables users, products and omniture.
    create table users (swid STRING, birth_dt STRING, gender_cd CHAR(1))
    ROW FORMAT DELIMITED
    FIELDS TERMINATED by '\t'
    stored as textfile 
    tblproperties ("skip.header.line.count"="1");

    create table products (url STRING, category STRING)
    ROW FORMAT DELIMITED
    FIELDS TERMINATED by '\t'
    stored as textfile 
    tblproperties ("skip.header.line.count"="1");

    create table omniturelogs (col_1 STRING,col_2 STRING,col_3 STRING,col_4 STRING,col_5 STRING,col_6 STRING,col_7 STRING,col_8 STRING,col_9 STRING,col_10 STRING,col_11 STRING,col_12 STRING,col_13 STRING,col_14 STRING,col_15 STRING,col_16 STRING,col_17 STRING,col_18 STRING,col_19 STRING,col_20 STRING,col_21 STRING,col_22 STRING,col_23 STRING,col_24 STRING,col_25 STRING,col_26 STRING,col_27 STRING,col_28 STRING,col_29 STRING,col_30 STRING,col_31 STRING,col_32 STRING,col_33 STRING,col_34 STRING,col_35 STRING,col_36 STRING,col_37 STRING,col_38 STRING,col_39 STRING,col_40 STRING,col_41 STRING,col_42 STRING,col_43 STRING,col_44 STRING,col_45 STRING,col_46 STRING,col_47 STRING,col_48 STRING,col_49 STRING,col_50 STRING,col_51 STRING,col_52 STRING,col_53 STRING)
    ROW FORMAT DELIMITED
    FIELDS TERMINATED by '\t'
    stored as textfile 
    tblproperties ("skip.header.line.count"="1");

Step 4: Load data into new tables

To load the data into the tables, we have to execute the following queries.
LOAD DATA INPATH '/tmp/admin/products.tsv' OVERWRITE INTO TABLE products;

LOAD DATA INPATH '/tmp/admin/users.tsv' OVERWRITE INTO TABLE users;

LOAD DATA INPATH '/tmp/admin/Omniture.0.tsv' OVERWRITE INTO TABLE omniturelogs;

To check if the data was loaded, click on the icon next to the table name. It executes a sample query.


Step 5: View and Refine the Data in the Sandbox

In the previous section, we created sandbox tables from uploaded data files. Now let’s take a closer look at that data.
Here’s a summary of the data we’re working with:
omniturelogs – website logs containing information such as URL, timestamp, IP address, geocoded IP, and session ID.


users – CRM user data listing SWIDs (Software User IDs) along with date of birth and gender.

products – CMS data that maps product categories to website URLs.

Now let’s use a Hive script to generate an “omniture” view that contains a subset of the data in the Omniture log table.
        CREATE VIEW omniture AS 
        SELECT col_2 ts, col_8 ip, col_13 url, col_14 swid, col_50 city, col_51 country, col_53 state 
        FROM omniturelogs 

Click Save as. On the “Saving item” pop-up, type “omniture” in the box, then click OK.

You can see your saved query now by clicking on the "Save Queries" button at the top.

Click Execute to run the script.
To view the data generated by the saved script, click on the icon next to the view's name at the Database Explorer.
The query results will appear, and you can see that the results include the data from the omniturelogs table that were specified in the query.

Finally, we’ll create a script that joins the omniture website log data to the CRM data (registered users) and CMS data (products). Click Query Editor, then paste the following text in the Query box:
    create table webloganalytics as 
    select to_date(o.ts) logdate, o.url, o.ip, o.city, upper(o.state) state, 
    o.country, p.category, CAST(datediff( from_unixtime( unix_timestamp() ), 
    from_unixtime( unix_timestamp(u.birth_dt, 'dd-MMM-yy'))) / 365  AS INT) age, u.gender_cd
    from omniture o 
    inner join products p     
    on o.url = p.url 
    left outer join users u 
    on o.swid = concat('{', u.swid , '}')
Save this script as “webloganalytics” and execute the script.

You can view the data generated by the script as described in the preceding steps.

Now that you have loaded data into the Hortonworks Platform, you can use Business Intelligence (BI) applications such as Microsoft Excel to access and analyze the data.