Skip to content

Instantly share code, notes, and snippets.

@vinodkc
Last active December 22, 2020 09:04
Show Gist options
  • Save vinodkc/523f6cae8afb77887130c7e0c10306b4 to your computer and use it in GitHub Desktop.
Save vinodkc/523f6cae8afb77887130c7e0c10306b4 to your computer and use it in GitHub Desktop.

Login to LLAP host node

A) Test with Spark-shell

step 1:

cd /tmp
wget https://raw.githubusercontent.com/dbompart/hive_warehouse_connector/master/hwc_info_collect.sh
chmod +x  hwc_info_collect.sh
./hwc_info_collect.sh

Read instructions in the output of above script and as user spark start the spark-shell

step 2: Run below statement (replace ' < fill table name > ' with your table name )

import com.hortonworks.hwc.HiveWarehouseSession
import com.hortonworks.hwc.HiveWarehouseSession._
val hive = HiveWarehouseSession.session(spark).build()
hive.showDatabases().show()
hive.executeQuery("select * from <fill table name>").show

step 2.1 : exit spark-shell

B) Test with pyspark

step 1 : Replace spark-shell with pyspark and start pyspark

step 2 : Run below statement (replace ' < fill table name > ' with your table name )

from pyspark.sql import SparkSession
from pyspark_llap.sql.session import HiveWarehouseSession
hive = HiveWarehouseSession.session(spark).build()
hive.executeQuery("select * from <fill table name>").show

step 3 : exit pyspark

C) Test with spark submit step 1 : create a python file cat ~/pyspark_HWCDemo.py

import sys
from pyspark.sql import SparkSession
from pyspark_llap.sql.session import HiveWarehouseSession


def main(argv):

    # initiate the spark session builder
    spark = SparkSession.builder.appName("pyspark_HWCDemo.py").enableHiveSupport().getOrCreate()

    hive = HiveWarehouseSession.session(spark).build()
    statement = sys.argv[1]
    if(statement.lower().startswith("select")):
	hive.executeQuery(statement).show()
    else:
    	hive.execute(statement).show()


if __name__ == '__main__':
   main(sys.argv[1:])

step 2 : To Run in Yarn Client mode. Replace spark-shell with spark-submit and ensure to include following configurations

--conf spark.sql.hive.hiveserver2.jdbc.url.principal=hive/_HOST@CLUSTERD.COM. // Replace with your hive principal name

--conf spark.security.credentials.hiveserver2.enabled=false

Select table eg:

spark-submit --master yarn --deploy-mode client --conf spark.datasource.hive.warehouse.load.staging.dir=/tmp --conf spark.datasource.hive.warehouse.metastoreUri=thrift://c4441-node3.coelab.cloudera.com:9083 --conf spark.hadoop.hive.llap.daemon.service.hosts=@llap0 --conf spark.jars=/usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.1.5.0-152.jar --conf spark.submit.pyFiles=/usr/hdp/current/hive_warehouse_connector/pyspark_hwc-1.0.0.3.1.5.0-152.zip  --conf spark.sql.hive.hiveserver2.jdbc.url="jdbc:hive2://c4441-node2.coelab.cloudera.com:2181,c4441-node3.coelab.cloudera.com:2181,c4441-node4.coelab.cloudera.com:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2-interactive" --conf spark.sql.hive.zookeeper.quorum="c4441-node2.coelab.cloudera.com:2181,c4441-node3.coelab.cloudera.com:2181,c4441-node4.coelab.cloudera.com:2181" --conf spark.security.credentials.hiveserver2.enabled=false  --conf spark.sql.hive.hiveserver2.jdbc.url.principal=hive/_HOST@CLUSTERD.COM  ~/pyspark_HWCDemo.py "select * from hwc_db.crimes"

Show tables eg:

spark-submit --master yarn --deploy-mode client --conf spark.datasource.hive.warehouse.load.staging.dir=/tmp --conf spark.datasource.hive.warehouse.metastoreUri=thrift://c4441-node3.coelab.cloudera.com:9083 --conf spark.hadoop.hive.llap.daemon.service.hosts=@llap0 --conf spark.jars=/usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.1.5.0-152.jar --conf spark.submit.pyFiles=/usr/hdp/current/hive_warehouse_connector/pyspark_hwc-1.0.0.3.1.5.0-152.zip  --conf spark.sql.hive.hiveserver2.jdbc.url="jdbc:hive2://c4441-node2.coelab.cloudera.com:2181,c4441-node3.coelab.cloudera.com:2181,c4441-node4.coelab.cloudera.com:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2-interactive" --conf spark.sql.hive.zookeeper.quorum="c4441-node2.coelab.cloudera.com:2181,c4441-node3.coelab.cloudera.com:2181,c4441-node4.coelab.cloudera.com:2181" --conf spark.security.credentials.hiveserver2.enabled=false  --conf spark.sql.hive.hiveserver2.jdbc.url.principal=hive/_HOST@CLUSTERD.COM  ~/pyspark_HWCDemo.py "show tables"

step 3 : To Run in Yarn Cluster mode.
--deploy-mode cluster --conf spark.security.credentials.hiveserver2.enabled=true

eg:

spark-submit --master yarn --deploy-mode cluster --conf spark.datasource.hive.warehouse.load.staging.dir=/tmp --conf spark.datasource.hive.warehouse.metastoreUri=thrift://c4441-node3.coelab.cloudera.com:9083 --conf spark.hadoop.hive.llap.daemon.service.hosts=@llap0 --conf spark.jars=/usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.1.5.0-152.jar --conf spark.submit.pyFiles=/usr/hdp/current/hive_warehouse_connector/pyspark_hwc-1.0.0.3.1.5.0-152.zip  --conf spark.sql.hive.hiveserver2.jdbc.url="jdbc:hive2://c4441-node2.coelab.cloudera.com:2181,c4441-node3.coelab.cloudera.com:2181,c4441-node4.coelab.cloudera.com:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2-interactive" --conf spark.sql.hive.zookeeper.quorum="c4441-node2.coelab.cloudera.com:2181,c4441-node3.coelab.cloudera.com:2181,c4441-node4.coelab.cloudera.com:2181" --conf spark.security.credentials.hiveserver2.enabled=true  --conf spark.sql.hive.hiveserver2.jdbc.url.principal=hive/_HOST@CLUSTERD.COM  ~/pyspark_HWCDemo.py "select * from hwc_db.crimes"

courtesy : David Bompart

Note : From llap node, if there is no direct internet connection to access https://raw.githubusercontent.com/dbompart/hive_warehouse_connector/master/hwc_info_collect.sh , copy following code into 'hwc_info_collect.sh'

set -e
#### For version: HDP 3.1.x
#### This script has to be run on the LLAP Host.
#### It can be ported to retrieve information using the Ambari REST APIs, allowing this to work from any host. Using bash for speed.

#### Pre-requisites:
# 1) LLAP is installed on this host.
# 2) Spark Client is installed on this host.

# File holder for required information:
hive_site_llap=/etc/hive_llap/conf/hive-site.xml
beeline_site_llap=/etc/hive_llap/conf/beeline-site.xml

if [ -r "$hive_site_llap" ] && [ -r "$beeline_site_llap" ]; then

    hive_metastore_uris=$(grep -e "thrift.*9083" "$hive_site_llap" |awk -F"<|>" '{print $3}')
    hive_llap_daemon_service_hosts=$(grep "hive.llap.daemon.service.hosts" -A1 "$hive_site_llap" |awk 'NR==2' | awk -F"[<|>]" '{print $3}')
    hive_zookeeper_quorum=$(grep "hive.zookeeper.quorum" -A1 "$hive_site_llap" |awk 'NR==2' | awk -F"[<|>]" '{print $3}')
    hwc_jar=$(find /usr/hdp/current/hive_warehouse_connector/ -name *assembly*.jar)
    hwc_pyfile=$(find /usr/hdp/current/hive_warehouse_connector/ -name *hwc*.zip)
    hive_jdbc_url=$(grep "beeline.hs2.jdbc.url.llap" -A1 "$beeline_site_llap" |awk 'NR==2' | awk -F"[<|>]" '{print $3}')
    hive_jdbc_url_principal=$(grep "hive.server2.authentication.kerberos.principal" -A1 "$hive_site_llap" |awk 'NR==2' | awk -F"[<|>]" '{print $3}')
    hive_zookeeper_quorum=$(grep "hive.zookeeper.quorum" -A1 "$hive_site_llap" |awk 'NR==2' | awk -F"[<|>]" '{print $3}')

    echo -e "To apply this configuration cluster wide, copy and paste the following list of properties in Ambari UI -> Spark2 -> Configs -> Advanced -> Custom spark2-defaults (Bulk Property Add mode)\n"
    echo -e "spark.datasource.hive.warehouse.load.staging.dir=/tmp"
    echo -e "spark.datasource.hive.warehouse.metastoreUri="$hive_metastore_uris
    echo -e "spark.hadoop.hive.llap.daemon.service.hosts="$hive_llap_daemon_service_hosts
    echo -e "spark.jars="$hwc_jar
    echo -e "spark.submit.pyFiles="$hwc_pyfile
    echo -e "spark.security.credentials.hiveserver2.enabled=false"
    echo -e "spark.sql.hive.hiveserver2.jdbc.url="$hive_jdbc_url
    echo -e "spark.sql.hive.zookeeper.quorum="$hive_zookeeper_quorum
    #If Kerberized:
    [ ! -z "$hive_jdbc_url_principal"] && echo -e "spark.sql.hive.hiveserver2.jdbc.url.principal="$hive_jdbc_url_principal

    echo -e "\n### Save and restart."
    echo -e "\nNote: In a kerberized environment the property spark.security.credentials.hiveserver2.enabled has to be set to TRUE for deploy-mode cluster, i.e.:\n spark-submit --conf spark.security.credentials.hiveserver2.enabled=true"

    echo -e "\nIf you'd like to test this per job instead of cluster wide, then use the following command as an example:\n

    spark-shell --master yarn --conf spark.datasource.hive.warehouse.load.staging.dir=/tmp --conf spark.datasource.hive.warehouse.metastoreUri=$hive_metastore_uris --conf spark.hadoop.hive.llap.daemon.service.hosts=$hive_llap_daemon_service_hosts --conf spark.jars=$hwc_jar --conf spark.submit.pyFiles=$hwc_pyfile --conf spark.security.credentials.hiveserver2.enabled=false --conf spark.sql.hive.hiveserver2.jdbc.url=\"$hive_jdbc_url\" --conf spark.sql.hive.zookeeper.quorum=\"$hive_zookeeper_quorum\" \n"

    echo -e "Once in the Scala REPL, run the following snippet example to test basic conectivity:\n"
    echo -e "scala> import com.hortonworks.hwc.HiveWarehouseSession"
    echo "scala> import com.hortonworks.hwc.HiveWarehouseSession._"
    echo "scala> val hive = HiveWarehouseSession.session(spark).build()"
    echo -e "scala> hive.showDatabases().show()\n"

else
     echo -e $hive_site_llap" and/or "$beeline_site_llap" doesn't exist on this host, or the current user $(whoami) doesn't have access to the files\n"
     echo "Try running this command as the root or hive user"
fi```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment