Skip to content

Instantly share code, notes, and snippets.

View vinodkc's full-sized avatar

Vinod KC vinodkc

  • Databricks
  • Mountainview
  • 07:01 (UTC -07:00)
View GitHub Profile
import java.io.{File, FileFilter}
import scala.collection.mutable.HashMap
val hadoopConfFiles = new HashMap[String, File]()
sys.env.get("SPARK_CONF_DIR").foreach { localConfDir =>
println("localConfDir : " + localConfDir)
val dir = new File(localConfDir)
if (dir.isDirectory) {
val files = dir.listFiles(new FileFilter {
override def accept(pathname: File): Boolean = {
import requests
import html
import json

# Define the Texgen API endpoint
HOST = 'cmlllm-textgenuiurl'
URI = f'https://{HOST}/api/v1/chat'
Please try the following steps to test HWC read and write from Oozie
Step 1 :
in hive , login as hive user
-----------------
create database db_hwc_test;
use db_hwc_test;
CREATE TABLE demo_input_table (
id int,
name varchar(10) )

CDP Livy ThriftServer Example

You can connect to the Apache Livy Thrift Server using the Beeline client that is included with Apache Hive.

The Livy Thrift Server is disabled by default.

a) To enable Livy Thrift Server (livy.server.thrift.enabled), from CM , enable by checking the box labeled Enable Livy Thrift Server

b) To use hive catalog, enable HMS Service from livy CM conf

HWC-Oozie integration

hive-warehouse-connector jar released as part of HDP 3.1.5 has many third party jars embedded in it , which is conflicting with oozie, to solve that issue , you have to get the hwc dev jar or hotfix jar which does not have those conflicting classes Internal JIRAs to handle this issue : BUG-122013,BUG-122269.

eg :

199679223 2021-01-17 08:48  hive-warehouse-connector-assembly-1.0.0.3.1.5.0-152.jar // actual jar
56340621 2021-01-17 08:36   hive-warehouse-connector-assembly-1.0.0.3.1.5.0-152_dev.jar // dev jar

Step 1: Login to LLAP host node

step 2:

cd /tmp
wget https://raw.githubusercontent.com/dbompart/hive_warehouse_connector/master/hwc_info_collect.sh
chmod +x  hwc_info_collect.sh
./hwc_info_collect.sh
mkdir -p ~/mytools/yarn && cd  ~/mytools/yarn

wget https://raw.githubusercontent.com/vinodkc/myscripts/main/yarn-extract-logs.py .

python yarn-extract-logs.py  <fill path to yarn aplication log> <name of new outputh directory>

Spark Event Log Job Trimmer

There are many instances, where Spark event log size grow very high, especially in the case of streaming jobs and it is difficult to transfer such a big file to another small cluster for offline analysis. Following shell script will help you to reduce the spark event log size by excluding old jobs from the event log file, so that you still can analyze issues with recent jobs.

After running this shell script on a Linux/Mac terminal, a trimmed output will be saved in the input folder with an extension _trimmed and you have to use that file for further analysis.

Usage instructions:

  1. Copy & paste below code snippet into a file trimsparkeventlog.sh

How to use hive builtin udf in spark sql

./spark-shell --jars /usr/hdp/current/hive-server2/lib/hive-exec.jar
val data = (1 to 10).toDF("col1").withColumn("col2",col("col1")).registerTempTable("table1")
spark.sql("CREATE TEMPORARY FUNCTION genericUDFAbsFromHive AS 'org.apache.hadoop.hive.ql.udf.generic.GenericUDFAbs'")
sql("select genericUDFAbsFromHive(col1-2000) as absCol1,col2 from table1").show(false)