Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 5 You must be signed in to star a gist
  • Fork 8 You must be signed in to fork a gist
  • Save vinodkc/5aee0d9cd03d18d05e03fa0264d628a7 to your computer and use it in GitHub Desktop.
Save vinodkc/5aee0d9cd03d18d05e03fa0264d628a7 to your computer and use it in GitHub Desktop.

Spark HWC integration - HDP 3 Secure cluster

Prerequisites :

  • Kerberized Cluster

  • Enable hive interactive server in hive

  • Get following details from hive for spark or try this HWC Quick Test Script

spark.hadoop.hive.llap.daemon.service.hosts @llap0
spark.sql.hive.hiveserver2.jdbc.url  jdbc:hive2://c420-node2.squadron-labs.com:2181,c420-node3.squadron-labs.com:2181,c420-node4.squadron-labs.com:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2-interactive
spark.datasource.hive.warehouse.metastoreUri thrift://c420-node3.squadron-labs.com:9083

Basic testing :

  1. Create a table employee in hive and load some data

eg: Create table

CREATE TABLE IF NOT EXISTS employee ( eid int, name String, salary String, destination String)
COMMENT 'Employee details'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;

Load data data.txt file into hdfs

1201,Vinod,45000,Technical manager
1202,Manisha,45000,Proof reader
1203,Masthanvali,40000,Technical writer
1204,Kiran,40000,Hr Admin
1205,Kranthi,30000,Op Admin
LOAD DATA INPATH '/tmp/data.txt' OVERWRITE INTO TABLE employee;
  1. kinit to the spark user and run
spark-shell --master yarn --conf "spark.security.credentials.hiveserver2.enabled=false" --conf "spark.sql.hive.hiveserver2.jdbc.url=jdbc:hive2://c420-node2.squadron-labs.com:2181,c420-node3.squadron-labs.com:2181,c420-node4.squadron-labs.com:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2-interactive;principal=hive/_HOST@HWX.COM" --conf "spark.datasource.hive.warehouse.metastoreUri=thrift://c420-node3.squadron-labs.com:9083" --conf "spark.datasource.hive.warehouse.load.staging.dir=/tmp/" --conf "spark.hadoop.hive.llap.daemon.service.hosts=@llap0" --conf "spark.hadoop.hive.zookeeper.quorum=c420-node2.squadron-labs.com:2181,c420-node3.squadron-labs.com:2181,c420-node4.squadron-labs.com:2181" --jars /usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.0.1.0-187.jar

Note: spark.security.credentials.hiveserver2.enabled should be set to false for YARN client deploy mode, and true for YARN cluster deploy mode (by default). This configuration is required for a Kerberized cluster

  1. run following code in scala shell to view the table data
import com.hortonworks.hwc.HiveWarehouseSession
val hive = HiveWarehouseSession.session(spark).build()
hive.execute("show tables").show
hive.executeQuery("select * from employee").show
  1. To apply common properties by default, add following setting into ambari spark2 custom conf
spark.sql.hive.hiveserver2.jdbc.url=jdbc:hive2://c420-node2.squadron-labs.com:2181,c420-node3.squadron-labs.com:2181,c420-node4.squadron-labs.com:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2-interactive;principal=hive/_HOST@HWX.COM
spark.datasource.hive.warehouse.metastoreUri=thrift://c420-node3.squadron-labs.com:9083
spark.datasource.hive.warehouse.load.staging.dir=/tmp/
spark.hadoop.hive.llap.daemon.service.hosts=@llap0
spark.hadoop.hive.zookeeper.quorum=c420-node2.squadron-labs.com:2181,c420-node3.squadron-labs.com:2181,c420-node4.squadron-labs.com:2181
  1. Run Spark-shell
spark-shell --master yarn  --conf "spark.security.credentials.hiveserver2.enabled=false" --jars /usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.0.1.0-187.jar

Note: Common properties are read from spark default properties

Pyspark example :

pyspark --master yarn --jars /usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.0.1.0-187.jar  --py-files  /usr/hdp/current/hive_warehouse_connector/pyspark_hwc-1.0.0.3.0.1.0-187.zip --conf spark.security.credentials.hiveserver2.enabled=false

Paste this code on shell

from pyspark_llap.sql.session import HiveWarehouseSession
hive = HiveWarehouseSession.session(spark).build()
  1. run following code in scala shell to view the hive table data
import com.hortonworks.hwc.HiveWarehouseSession
val hive = HiveWarehouseSession.session(spark).build()
hive.execute("show tables").show
hive.executeQuery("select * from employee").show
  1. To integrate HWC in Livy2

a. add following property in Custom livy2-conf

livy.file.local-dir-whitelist=/usr/hdp/current/hive_warehouse_connector/

b. Add hive-site.xml to /usr/hdp/current/spark2-client/conf on all cluster nodes.

c. Ensure hadoop.proxyuser.hive.hosts=* exists in core-site.xml ; refer Custom core-site section in HDFS confs

d. Login to Zeppelin and in livy2 interpreter settings add following

livy.spark.hadoop.hive.llap.daemon.service.hosts	@llap0
livy.spark.security.credentials.hiveserver2.enabled	true
livy.spark.sql.hive.hiveserver2.jdbc.url	jdbc:hive2://c420-node2.squadron-labs.com:2181,c420-node3.squadron-labs.com:2181,c420-node4.squadron-labs.com:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2-interactive
livy.spark.sql.hive.hiveserver2.jdbc.url.principal	hive/_HOST@HWX.COM
livy.spark.yarn.security.credentials.hiveserver2.enabled	true
livy.spark.jars file:///usr/hdp/current/hive_warehouse_connector/hive-warehouse-connector-assembly-1.0.0.3.0.1.0-187.jar

Note: Ensure to change the version of hive-warehouse-connector-assembly to match your HWC version

d. Restart livy2 interpreter

e. in first paragraph add

%livy2
import com.hortonworks.hwc.HiveWarehouseSession
val hive = HiveWarehouseSession.session(spark).build()

f. in second paragraph add

%livy2
hive.executeQuery("select * from employee").show

Note: There is an Ambari defect: AMBARI-22801, which reset the proxy configs on keytab regenration/service addition. Please follow the step 7.c again in such scenarios

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment