saptak/hive_testbench.md

## hive_testbench.md

      
    Raw
  

              hive_testbench.md
            
          
Sudo to the hdfs user to begin generating data. Change to the home directory for the hdfs user:

sudo -u hdfs -s
cd /home/hdfs


Download the testbench utilities from Github and unzip them:

wget https://github.com/hortonworks/hive-testbench/archive/hive14.zip
unzip hive14.zip


Open the load_partitioned.sql file in an editor:

vi hive-testbench-hive14/settings/load-partitioned.sql


Correct the hive.tez.java.opts setting:


Comment out the line below by adding -- at the beginning of the line:
  1. -- set hive.tez.java.opts=-XX:+PrintGCDetails -verbose:gc -XX:+PrintGCTimeStamps -XX:+UseNUMA -XX:+UseG1GC -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/;

Add the line below:
  1. set hive.tez.java.opts=-XX:+PrintGCDetails -verbose:gc -XX:+PrintGCTimeStamps -XX:+UseNUMA -XX:+UseParallelGC -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/;

Save the file and exit.


Generate 30G of test data:

/* In case GCC is not installed */
yum install gcc
 
/* If javac is not found */
export JAVA_HOME=/usr/jdk64/jdk1.8.0_77
export PATH=$JAVA_HOME/bin:$PATH
 
cd hive-testbench-hive14/
sudo ./tpcds-build.sh
./tpcds-setup.sh 30


A map reduce job runs to create the data and load the data into hive. This will take some time to complete. The last line in the script is:


Data loaded into database tpcds_bin_partitioned_orc_30.


Choose a query to run for benchmarking. For example query55.sql. Copy the query of of your choice and make an explain version of the query. The explain query will be helpful later on to see how hive is planning the query.

cd sample-queries-tpcds
cp query55.sql explainquery55.sql
vi explainquery55.sql


Add the keyword explain before the query. For example the first line of the explain of query 55:
explain select i_brand_id brand_id, i_brand brand,
Save and quit out of the file.


You are now ready to issue a benchmark query. Start the beeline hive2 cli.

beeline -i testbench.settings -u jdbc:hive2://localhost:10500/tpcds_bin_partitioned_orc_30


To try a query without LLAP, set hive.llap.execution.mode=none and run a query. For example, the command line below will run benchmark query 55:

set hive.llap.execution.mode=none;
!run query55.sql


Note the completion time at the end of the query is 18.984 without LLAP:


Now try the query with LLAP, set hive.llap.execution.mode=all and run the query again:

set hive.llap.execution.mode=all;
!run query55.sql


Notice that the query with LLAP completes much more quickly. If you don’t see a significant speed up at first, try the same query again. As the LLAP cache fills with data, queries respond more quickly. Below are the results of the next two runs of the same query with LLAP set to all. The second query returned in 8.455 seconds and a subsequent query in 2.745 seconds. If your cluster has been up and you have been doing LLAP queries on this data your performance my be in the 2 second range on the first try:


To see the difference between the query plans, use the explain query to show the plan for a query with no LLAP. Take note of the vectorized outlined in red in the screen shot below:

set hive.llap.execution.mode=none;
!run explainquery55.sql


Try the explain again, with LLAP enabled:

set hive.llap.execution.mode=all;
!run explainquery55.sql


Notice in the explain plan for the LLAP query, LLAP is shown after the vectorized keyword.