Skip to content

Instantly share code, notes, and snippets.

@saptak
Created July 17, 2017 21:58
Show Gist options
  • Save saptak/ac3e5c66338efe24c3448652b57b3b20 to your computer and use it in GitHub Desktop.
Save saptak/ac3e5c66338efe24c3448652b57b3b20 to your computer and use it in GitHub Desktop.
  1. Sudo to the hdfs user to begin generating data. Change to the home directory for the hdfs user:

    1. sudo -u hdfs -s
    2. cd /home/hdfs
  2. Download the testbench utilities from Github and unzip them:

    1. wget https://github.com/hortonworks/hive-testbench/archive/hive14.zip
    2. unzip hive14.zip
  3. Open the load_partitioned.sql file in an editor:

    1. vi hive-testbench-hive14/settings/load-partitioned.sql
  4. Correct the hive.tez.java.opts setting:

Comment out the line below by adding -- at the beginning of the line:

  1. -- set hive.tez.java.opts=-XX:+PrintGCDetails -verbose:gc -XX:+PrintGCTimeStamps -XX:+UseNUMA -XX:+UseG1GC -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/;

Add the line below:

  1. set hive.tez.java.opts=-XX:+PrintGCDetails -verbose:gc -XX:+PrintGCTimeStamps -XX:+UseNUMA -XX:+UseParallelGC -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp/;

Save the file and exit.

  1. Generate 30G of test data:

    1. /* In case GCC is not installed */
    2. yum install gcc
    3.  
    4. /* If javac is not found */
    5. export JAVA_HOME=/usr/jdk64/jdk1.8.0_77
    6. export PATH=$JAVA_HOME/bin:$PATH
    7.  
    8. cd hive-testbench-hive14/
    9. sudo ./tpcds-build.sh
    10. ./tpcds-setup.sh 30
  2. A map reduce job runs to create the data and load the data into hive. This will take some time to complete. The last line in the script is:

Data loaded into database tpcds_bin_partitioned_orc_30.

  1. Choose a query to run for benchmarking. For example query55.sql. Copy the query of of your choice and make an explain version of the query. The explain query will be helpful later on to see how hive is planning the query.

    1. cd sample-queries-tpcds
    2. cp query55.sql explainquery55.sql
    3. vi explainquery55.sql

Add the keyword explain before the query. For example the first line of the explain of query 55:

explain select i_brand_id brand_id, i_brand brand,

Save and quit out of the file.

  1. You are now ready to issue a benchmark query. Start the beeline hive2 cli.

    1. beeline -i testbench.settings -u jdbc:hive2://localhost:10500/tpcds_bin_partitioned_orc_30
  2. To try a query without LLAP, set hive.llap.execution.mode=none and run a query. For example, the command line below will run benchmark query 55:

    1. set hive.llap.execution.mode=none;
    2. !run query55.sql

Note the completion time at the end of the query is 18.984 without LLAP:

  1. Now try the query with LLAP, set hive.llap.execution.mode=all and run the query again:

    1. set hive.llap.execution.mode=all;
    2. !run query55.sql

  1. Notice that the query with LLAP completes much more quickly. If you don’t see a significant speed up at first, try the same query again. As the LLAP cache fills with data, queries respond more quickly. Below are the results of the next two runs of the same query with LLAP set to all. The second query returned in 8.455 seconds and a subsequent query in 2.745 seconds. If your cluster has been up and you have been doing LLAP queries on this data your performance my be in the 2 second range on the first try:

  1. To see the difference between the query plans, use the explain query to show the plan for a query with no LLAP. Take note of the vectorized outlined in red in the screen shot below:

    1. set hive.llap.execution.mode=none;
    2. !run explainquery55.sql

  1. Try the explain again, with LLAP enabled:

    1. set hive.llap.execution.mode=all;
    2. !run explainquery55.sql
  2. Notice in the explain plan for the LLAP query, LLAP is shown after the vectorized keyword.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment