Skip to content

Instantly share code, notes, and snippets.

View vinodkc's full-sized avatar

Vinod KC vinodkc

  • Databricks
  • Mountainview
  • 06:41 (UTC -08:00)
View GitHub Profile
@vinodkc
vinodkc / Secure Kafka.md
Last active August 16, 2024 21:42
Secure Kafka

Steps to Enable Secure SSL Kafka brokers

1) Kerberize Cluster

1.1) Login to a Kafka client node and export KAFKA_CLIENT_KERBEROS_PARAMS

export KAFKA_CLIENT_KERBEROS_PARAMS="-Djava.security.auth.login.config=/usr/hdp/current/kafka-broker/config/kafka_client_jaas.conf" 

1.2) Lets test kafka producer and consumer without keytab

Use this jaas conf file

import java.io.{File, FileFilter}
import scala.collection.mutable.HashMap
val hadoopConfFiles = new HashMap[String, File]()
sys.env.get("SPARK_CONF_DIR").foreach { localConfDir =>
println("localConfDir : " + localConfDir)
val dir = new File(localConfDir)
if (dir.isDirectory) {
val files = dir.listFiles(new FileFilter {
override def accept(pathname: File): Boolean = {

Spark Event Log Job Trimmer

There are many instances, where Spark event log size grow very high, especially in the case of streaming jobs and it is difficult to transfer such a big file to another small cluster for offline analysis. Following shell script will help you to reduce the spark event log size by excluding old jobs from the event log file, so that you still can analyze issues with recent jobs.

After running this shell script on a Linux/Mac terminal, a trimmed output will be saved in the input folder with an extension _trimmed and you have to use that file for further analysis.

Usage instructions:

  1. Copy & paste below code snippet into a file trimsparkeventlog.sh
import requests
import html
import json

# Define the Texgen API endpoint
HOST = 'cmlllm-textgenuiurl'
URI = f'https://{HOST}/api/v1/chat'

Spark Listener Demo

This demonstrates Spark Job, Stage and Tasks Listeners

1) Start spark-shell

Welcome to
      ____              __
 / __/__ ___ _____/ /__

CDP Livy ThriftServer Example

You can connect to the Apache Livy Thrift Server using the Beeline client that is included with Apache Hive.

The Livy Thrift Server is disabled by default.

a) To enable Livy Thrift Server (livy.server.thrift.enabled), from CM , enable by checking the box labeled Enable Livy Thrift Server

b) To use hive catalog, enable HMS Service from livy CM conf

cd /usr/hdp/current/kafka-broker/bin/
[kafka@c220-node2 bin]$ ./kafka-topics.sh --create --zookeeper c220-node2.squadron-labs.com:2181 --replication-factor 2 --partitions 3 --topic source1
Created topic "source1".
[kafka@c220-node2 bin]$ ./kafka-topics.sh --create --zookeeper c220-node2.squadron-labs.com:2181 --replication-factor 2 --partitions 3 --topic dest1
Created topic "dest1".
[kafka@c220-node2 bin]$ ./kafka-console-producer.sh --broker-list c220-node2.squadron-labs.com:6667 --topic source1
[kafka@c220-node4 bin]$ ./kafka-console-consumer.sh --bootstrap-server c220-node4.squadron-labs.com:6667 --topic dest1
Please try the following steps to test HWC read and write from Oozie
Step 1 :
in hive , login as hive user
-----------------
create database db_hwc_test;
use db_hwc_test;
CREATE TABLE demo_input_table (
id int,
name varchar(10) )

Spark HWC integration - HDP 3 Secure cluster

Prerequisites :

  • Kerberized Cluster

  • Enable hive interactive server in hive

  • Get following details from hive for spark or try this HWC Quick Test Script

mkdir -p ~/mytools/yarn && cd  ~/mytools/yarn

wget https://raw.githubusercontent.com/vinodkc/myscripts/main/yarn-extract-logs.py .

python yarn-extract-logs.py  <fill path to yarn aplication log> <name of new outputh directory>