saptak/Using-Hive-for-Data-Analysis.md

## Using-Hive-for-Data-Analysis.md

      
    Raw
  

              Using-Hive-for-Data-Analysis.md
            
          
    Overview

Hive is designed to enable easy data summarization and ad-hoc analysis of large volumes of data. It uses a query language called Hive-QL which is similar to SQL.
In this tutorial, we will explore the following:

Load a data file into a Hive table
Create a table using RCFormat
Query tables
Managed tables vs external tables
ORC format
PARTITIONED a Table
Bucketing a Table

Prerequisites

A working HDP cluster – the easiest way to have a HDP cluster is to download the Hortonworks Sandbox.
Step 1. Let’s load a data file into a Hive table.

First of all, download data file from here click here and name the file as TwitterData.txt .
As the file is small, you can simply open it, copy and create a local file in the sandbox manually as well.
We will use the Ambari UI here.
Open http://localhost:8080 in your browser.
Now, click on the HDFS Files button from the Off-canvas menu at the top and you will see the following screen.

Navigate to the /tmp folder.

Now click on Upload option and select file Twitterdata.txt from your computer.

Here is a sample syntax to create a table and load datafile into the table.
Let’s create this table and load data.
For this we can use Ambari as well or the command line.
Open the Hive View by clicking on the Hive button in the views menu.

And type into the composition area the following query:
    CREATE TABLE TwitterExampletextexample(
        tweetId BIGINT, username STRING,
        txt STRING, CreatedAt STRING,
        profileLocation STRING,
        favc BIGINT,retweet STRING,retcount BIGINT,followerscount BIGINT)
    COMMENT 'This is the Twitter streaming data'
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY '\t'
    STORED AS TEXTFILE;
    LOAD  DATA  INPATH  '/tmp/Twitterdata.txt' OVERWRITE INTO TABLE TwitterExampletextexample;

To query sample data of your previously created table click the icon next to the table name, it executes a select query.

Step 2. Let’s create a table using RCfile format

Record Columnar(RC) format determines how to store relational tables on distributed computer clusters. With this format, you can get the advantages of a columnar format over row format of a record.
Here is a sample Create RC file format table syntax:
    CREATE TABLE TwitterExampleRCtable(
        tweetId BIGINT, username STRING,
        txt STRING, CreatedAt STRING,
        profileLocation STRING COMMENT 'Location of user',
        favc BIGINT,retweet STRING,retcount BIGINT,followerscount BIGINT)
    COMMENT 'This is the Twitter streaming data'
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY '\t'
    STORED AS RCFILE;

Here is the step on how to Load Data into the RC Table. Please execute and see the results.
    INSERT OVERWRITE TABLE TwitterExampleRCtable select * from  TwitterExampletextexample;

Run the sample select query again.

Step 3. Let’s query the table we just created.

Let’s find top 10 countries who tweeted most using TwitterExampleRCtable.
    SELECT profileLocation, COUNT(txt) as count1 
    FROM TwitterExampleRCtable 
    GROUP BY profileLocation 
    ORDER BY count1 desc limit 10;
Please see the following log and the results:

Step 4. Let’s look at Managed tables vs External tables

Managed tables are created by default with CREATE TABLE statements, whereas External tables are used when you want your tables to point to data files in place, therefore it has to be a folder you point to.
Here is the syntax for creating these tables.
Managed:
    CREATE TABLE ManagedExample(
        tweetId BIGINT, username STRING,
        txt STRING, CreatedAt STRING,
        profileLocation STRING,
        favc BIGINT,retweet STRING,retcount BIGINT,followerscount BIGINT)
    COMMENT 'This is the Twitter streaming data'
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY '\t'
    STORED AS TEXTFILE;

For the External table we need to create a new directory /tmp/admin and Upload Twitterdata.txt.

Before we create the table and load the data in it, we have to change the permission for the /tmp/admin folder.
Open the HDFS view and navigate to /tmp.
Right click on admin and select Permissions:

Now check the Write buttons and modify recursively and press save.

Verify that the permissions look now like this:

External:
    CREATE EXTERNAL TABLE IF NOT EXISTS ExternalExample(
        tweetId BIGINT, username STRING,
        txt STRING, CreatedAt STRING,
        profileLocation STRING,
        favc BIGINT,retweet STRING,retcount BIGINT,followerscount BIGINT)
    COMMENT 'This is the Twitter streaming data'
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY '\t'
    STORED AS TEXTFILE
    location '/tmp/admin';

Also, when you drop a Managed table, it deletes the metadata, and it also deletes the data.
When you drop an External table, it only deletes the metadata.
By creating a managed table the file you load in is moved to /apps/hive/warehouse that means that the data is controlled by hive.
Whereas the external tables points to the /tmp/admin directory in which we put the Twitterdata.txt.
If we run the sample query you should see the data from this file.

As a next step, you could describe the above tables as below and compare the output with managed vs. external tables.
    describe formatted ManagedExample;

    describe formatted ExternalExample;

Step 5. Hive ORC File format.

Optimized Row Columnar (ORC) File format is used as it further compresses data files. It could result in a small performance loss in writing, but there will be huge performance gain in reading.
Let’s try it out. Please see that the table is stored as ORC.
    CREATE TABLE ORCFileFormatExample(
        tweetId BIGINT, username STRING,
        txt STRING, CreatedAt STRING,
        profileLocation STRING COMMENT 'Location of user',
        favc INT,retweet STRING,retcount INT,followerscount INT)
    COMMENT 'This is the Twitter streaming data'
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY '\t'
    STORED AS ORC tblproperties ("orc.compress"="ZLIB");

Step 6. Let’s create a PARTITIONED Table and load data into

Partitions are horizontal slices of data which allow large sets of data to be segmented into more manageable blocks.
Here is the sample syntax to create a partitioned table and load data into partitions.
    CREATE TABLE PARTITIONEDExample(
      tweetId BIGINT, username STRING, txt STRING,favc BIGINT,retweet STRING,
      retcount BIGINT,followerscount BIGINT) 
    COMMENT 'This is the Twitter streaming data' 
    PARTITIONED BY(CreatedAt STRING, profileLocation STRING) 
    ROW FORMAT DELIMITED FIELDS 
    TERMINATED BY '\t' STORED AS TEXTFILE;

    INSERT OVERWRITE TABLE PARTITIONEDExample 
    PARTITION (CreatedAt="26 04:50:56 UTC 2014",profileLocation="Chicago") 
    SELECT tweetId,username,txt,favc,retweet,retcount,followerscount 
    FROM twitterexampletextexample
    where profileLocation='Chicago' limit 100;

Let's execute the sample query to check if it's been overwritten correctly.

Step 7. Let’s create a table with Buckets.

Bucketing is a technique that allows to cluster or segment large sets of data to optimize query performance.
Here is an example for creating a table with buckets and load data into it.
    CREATE TABLE BucketingExample(
        tweetId BIGINT, username STRING,
        txt STRING,CreatedAt STRING,favc BIGINT,retweet STRING,retcount BIGINT, followerscount BIGINT)
    COMMENT 'This is the Twitter streaming data'
    PARTITIONED BY( profileLocation STRING)
    CLUSTERED BY(tweetId) INTO 2 BUCKETS
    ROW FORMAT DELIMITED
    FIELDS TERMINATED BY '\t'
    STORED AS TEXTFILE;

    set hive.enforce.bucketing = true; 
    INSERT OVERWRITE TABLE BucketingExample PARTITION (profileLocation="Chicago")    
    SELECT tweetId,username,txt,CreatedAt,favc,retweet,retcount,followerscount       
    FROM twitterexampletextexample
    where profileLocation='Chicago' limit 100;

Here is a reference when you executed a sample query.

You can go to hdfs folder and see the directory structure behind these Hive tables that you have just created. That could help you to design your tables and file distributions which is very important in designing your warehouse.
Hope, this was helpful and simple enough to give you a glimpse of the Hive world.