saptak/interactive_query_with_apache_hive.md

## interactive_query_with_apache_hive.md

      
    Raw
  

              interactive_query_with_apache_hive.md
            
          
    In this tutorial, we’ll focus on taking advantage of improvements to Apache Hive and Apache Tez through the work completed by the community as part of the Stinger initiative. 
In this tutorial, we are going to cover:

Performance improvements of Hive on Tez
Performance improvements of Vectorized Query
Cost-based Optimization Plans
Multi-tenancy with HiveServer2
SQL Compliance Improvements

Prerequisites

A working HDP cluster – the easiest way to have a HDP cluster is to download the HDP Sandbox. 
Download data

The dataset that we will need for this tutorial is here. Please download and save the file in a folder on your local machine.
Once you unzip the zip file – SensorFiles.zip, you will see the following files inside. We will be using these datafiles for the following tutorial.

Let’s use the above two csv files (HVAC.csv & building.csv) to create two new tables using the following step. Navigate to http://127.0.0.1:8080 using your browser. 
Load Data to HDFS

Go to the HDFS view, 

navigate to /user/admin and upload "hvac" and "building".

Create Hive tables and load data

Open Hive View

and create the two tables. 
The following queries create the the tables:
create table hvac (stage STRING, time STRING, targettemp BIGINT, actualtemp BIGINT, system BIGINT, systemage BIGINT, building_id                            BIGINT)
row format delimited 
fields terminated by ',' 
stored as textfile
tblproperties ("skip.header.line.count"="1");


create table building (building_id BIGINT, building_mgr STRING, building_age BIGINT, hvacproduct STRING, country STRING)
row format delimited 
fields terminated by ',' 
stored as textfile
tblproperties ("skip.header.line.count"="1");


Now, your dataexplorer should show the two tables building and hvac. 
To load the data from the csv files into the tables, we execute the following queries.
LOAD DATA INPATH '/user/admin/building.csv' OVERWRITE INTO TABLE building;


LOAD DATA INPATH '/user/admin/HVAC.csv' OVERWRITE INTO TABLE hvac;


To test if the data was loaded correctly, click the icon next to the table name at the right of the database explorer. 
Speed Improvements

To take a look at the speed improvements of Hive on Tez, we can run some sample queries. For this we will use the above two tables – hvac and building.
Step 1 :

We will SSH into the VM and launch the Hive Shell.
ssh root@127.0.0.1 -p 2222;

the password is hadoop
We will run first Hive without Tez.

Step 2:

Please note that Hive is running using MapReduce Framework from the log output on your screen.
set hive.execution.engine=mr;

Then, let’s execute the hiveql as below.
select h.*, b.country, b.hvacproduct, b.building_age, b.building_mgr 
from building b join hvac h 
on b.building_id = h.building_id;


This query was run using the MapReduce framework. Note the time it takes your query to execute. In the example above it took 29.28 seconds.
Step 3 :

Now we can enable Hive on Tez execution and take advantage of Directed Acyclic Graph (DAG) execution representing the query instead of multiple stages of MapReduce program which involved a lot of synchronization, barriers and IO overheads. This is improved in Tez, by writing intermediate data set into memory instead of hard disk.
Use the following step to set the execution engine to Tez:
set hive.execution.engine=tez;

Step 4 :

Run the same query as we had run earlier in Step 2, to see if the speed has improved or not.
select h.*, b.country, b.hvacproduct, b.building_age, b.building_mgr 
from building b join hvac h 
on b.building_id = h.building_id;

Check the output of this job. It shows the usage of the containers.

In this example, Hive on Tez was considerably faster than the MapReduce execution taking 10.957 secs compared to earlier 29.28 secs.
Congratulations! You have successfully run your Hive on Tez Job.
Step 5:

Now let’s rerun the same query from Step 2 or Step 4.
select a.building_id, b.building_mgr, max(a.targettemp-a.actualtemp)
from hvac a join building b
on a.building_id = b.building_id
group by a.building_id, b.building_mgr;

Again, it should run faster as it will use hot containers produced in the Step 4 since you are executing in the same Hive Client session.
Here is the result.

This time the job took only 13.363 secs, a considerable improvement.
To experience this further, you could use your own dataset, upload to your HDP Sandbox using steps above and execute with Tez and without.
Step 6:

You can track your Hive on Tez jobs in HDP Sandbox Web UI as well. Please go to : http://127.0.0.1:8088/cluster and track your jobs while running or post to see the details.

You can click on your job and see further details.
Query Vectorization

Now let’s check if the usage of Vectorization speeds this up further.
What is Vectorization? When Vectorization feature is used, it fetches 1000 rows at a time instead of 1 for processing. So, it can process up to 3X faster with less CPU time. This results in improved cluster utilization. It is to address the latency Problem in Hive by extensive Container use and reuse. Vectorization feature works on Hive tables with ORC File Format only.
Step 1:

Let’s create a table with ORC file format as follows:
create table hvac_orc stored as orc as select * from hvac;

Step 2:

Run the following statement to enable Tez.
set hive.execution.engine=tez;

Step 3:

Run the following query.
select stage, count(building_id) from hvac group by stage;

Note down the time taken.
Step 4:

Now let’s run the following sql query:
select stage, count(building_id) from hvac_orc group by stage;

Note down the time taken and compare to step 3.
Step 5:

Now let’s run the following steps to enable vectorization:
set hive.vectorized.execution.enabled;

and then run the sql query from previous step
select stage, count(building_id) from hvac_orc group by stage;

This time it runs with a vectorized query plan, which scales very well especially with large datasets.
Step 6:

Let’s look at the ‘explain’ plan to confirm that it is indeed using a vectorized query plan:
explain select stage, count(building_id) from hvac_orc group by stage;


Please note that in the explain plan, the Execution mode is “vectorized”. When this feature is switched off, you will not see the same line in the plan.
Stats & Cost Based Optimization (CBO)

Cost Based Optimization(CBO) engine uses statistics within Hive tables to produce optimal query plans.
Benefits of CBO:


Reduces need of a specialists to tune queries
More efficient query plans lead to better cluster utilization

Types of Stats

There are two types of stats which could be collected so that the optimizer could use it in the decision making process :

Table Stats
Column Stats

The ‘explain’ plan feature can be used to see if the correct stats are being used.
Note : CBO requires column stats. 
Phases in which stats could be collected


While data is inserted: hive.stats.autographer = [true, **false**]
On existing data : table level ANALYZE TABLE table [partion(key)] COMPUTE STATISTICS;
On existing data : column level ANALYZE TABLE table [partion(key)] COMPUTE STATISTICS FOR COLUMNS col1,col2,...;

Configuration to make CBO effective for your query


hive.compute.query.using.stats = [true, **false**];
hive.stats.fetch.column.stats = [true, **false**];
hive.stats.fetch.partition.stats = [true, **false**];
hive.cbo.enable = [true, **false**];

Step 1:

Let’s do a simple exercise. Let’s run the following query and see how long it takes.
select building_id, max(targettemp-actualtemp) from hvac group by building_id;

Please note down the time taken.
Step 2:

Now, let’s explain the above query in Step 1.
explain select building_id, max(targettemp-actualtemp) from hvac group by building_id;


Please note the the CBO feature is not used. You will see that Basic stats is None in the ‘explain’ plan.
Step 3:

Now, we will tune the same query so that it uses Cost Based Optimization (CBO). Let’s collect statistics on the table hvac.
analyze table hvac compute statistics;


You are using Tez execution engine.
Step 4:

Let’s collect statistics of a few columns in this table hvac. To use CBO, column level statistics are required.
analyze table hvac compute statistics for columns targettemp, actualtemp, building_id;


Step 5:

Now let’s set the 4 settings in hive as follows and run explain on the query.
set hive.compute.query.using.stats=true;
set hive.stats.fetch.partition.stats=false;
set hive.cbo.enable=true;
set hive.stats.fetch.column.stats=true;
explain select building_id, max(targettemp-actualtemp)
from hvac group by building_id;

Note that the Plan says that it is using stats now.

Step 6:

Let’s rerun the query now and observe if it runs faster. You will see better gain with a good volume of dataset than the one we are working with.
select building_id, max(targettemp-actualtemp) from hvac group by building_id;


Please note down total time taken and compare to Step 1.
SQL Compliance

There are several SQL query enhancements in this version of Hive.
Query Enhancements Support extensions:


Expanded Join Semantics – Supports from table1, table2 where table1.col1=table2.col2
IN, NOT IN subqueries in WHERE Clause
EXISTS and NOT EXISTS
Correlated Subqueries with equality operation only
Common Table Expressions (CTE)
The CHAR datatype – trailing White Space

Authorization System enhancements:


SQL Authorizations : Actions

Grant/Revoke

Create
Insert
Select
Drop
Delete
All

Create Roles & Grant with admin option
Using views to restrict data visibility


We will go into these in much more details in a later tutorial.