idris75

## mapVsflatMap
The difference between map, flatMap is a little confusing for beginers - this example might help:
This can be tested on a spark shell or scala CLI:

scala> val l = List(1,2,3,4,5)

scala> l.map(x => List(x-1, x, x+1))
res1: List[List[Int]] = List(List(0, 1, 2), List(1, 2, 3), List(2, 3, 4), List(3, 4, 5), List(4, 5, 6))

scala> l.flatMap(x => List(x-1, x, x+1))
res2: List[Int] = List(0, 1, 2, 1, 2, 3, 2, 3, 4, 3, 4, 5, 4, 5, 6)

## RDD-To-DF-To-DS

val lrdd =sc.parallelize( List(1,2,3,4,5,3,5))
//without case class
val namedDF = sqlContext.createDataFrame(lrdd.map(Tuple1.apply)).toDF("Id")
//with case class
case class Dummy(Id: Int)
val namedDF = lrdd.map(x => Dummy(x.toInt)).toDF()
//one liner DF
val ldf = List(1,2,3,4,5,3,5).toDS().toDF()

## reduceVsreduceByKey

val lrdd =sc.parallelize( List(1,2,3,4,5,3,5))

lrdd.reduce((a,b) => a+b)
//short version of the syntax below
lrdd.reduce( _+_)
// try changing reduce to reduceByKey, which will fail
// as reduceByKey is applicable to key-value pairs
lrdd.reduceByKey( _+_)

## SparkStreaming
#Analysing from Kafka topic:
#This below script can be put in a scala script name.scala and run from spark-shell
#This can be created as a scala project as well, remove the comments, use the dependencies below in build.sbt and compile

import kafka.serializer.StringDecoder
import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka._
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.InputDStream
import org.apache.spark.sql._

## Hive performance Tuning

	1.MapJoin:
	small tables can be loaded in memory and joined with bigger tables.
		1. use hint /*+ MAPJOIN(table_name)  */
		2. 'better' option - let hive do automatically by setting these properties:
		hive.auto.convert.join	-	true
		hive.mapjoin.smalltable.filesize = <>	default is 25MB

	2.Partition Design
		Low cardinality column -eg, regiou, year

## kafka gist
-------------kafka notes-----------
why?
better throughput
Replication
built-in partitioning
Fault tolerance


topics are unique!!
location of a message -> topic  -  partition - offset

## Kafka performance tuning
1.Producer
	1.request.required.acks=[0,1,all/-1]  0 no acknowledgement but ver fast, 1 acknowledged after leader commits, all acknowledged after replicated

	2.use Async producer	- use callback for the acknowledgement, using property  producer.type=1
	3.Batching data - send multiple messages together.
		batch.num.messages
		queue.buffer.max.ms
	4.Compression for Large files - gzip, snappy supported
		very large files can be stored in shared location and just the file path can be logged by the kafka producer.


## LinuxCommands
env # to get all env variables

#Find class files in a jar
all_hdp_classes () {
  find -L /usr/hdp/current -maxdepth 20 -name "*.jar" -print | while read line; do
    for i in `jar tf $line | grep .class`
    do
    echo $line : $i
    done
  done

## Spark Tuning
1.mapPartition() instead of map()    - when some expensive initializations like DBconnection need to be done

2.RDD Parallelism: for No parent RDDs, example, sc.parallelize(',,,',4),Unless specified YARN will try to use as many CPU cores as available
	This could be tuned using spark.default.parallelism property.

	- to find default parallelism use sc.defaultParallelism
	  rdd.getNumPartitions()
	  rdd = sc.parallelize(<value>, numSlices=4)
	  rdd.getNumPartitions() will return 4


## Linux commands
env # to get all env variables


*********to work as root*************
su -
**************ifconfig synonyms------------
ip address show or ip a s or ip a s eth0
************formatted file name************
cp a.txt a_$(date +%F).txt
	The difference between map, flatMap is a little confusing for beginers - this example might help:
	This can be tested on a spark shell or scala CLI:

	scala> val l = List(1,2,3,4,5)

	scala> l.map(x => List(x-1, x, x+1))
	res1: List[List[Int]] = List(List(0, 1, 2), List(1, 2, 3), List(2, 3, 4), List(3, 4, 5), List(4, 5, 6))

	scala> l.flatMap(x => List(x-1, x, x+1))
	res2: List[Int] = List(0, 1, 2, 1, 2, 3, 2, 3, 4, 3, 4, 5, 4, 5, 6)

	val lrdd =sc.parallelize( List(1,2,3,4,5,3,5))
	//without case class
	val namedDF = sqlContext.createDataFrame(lrdd.map(Tuple1.apply)).toDF("Id")
	//with case class
	case class Dummy(Id: Int)
	val namedDF = lrdd.map(x => Dummy(x.toInt)).toDF()
	//one liner DF
	val ldf = List(1,2,3,4,5,3,5).toDS().toDF()

	val lrdd =sc.parallelize( List(1,2,3,4,5,3,5))

	lrdd.reduce((a,b) => a+b)
	//short version of the syntax below
	lrdd.reduce( _+_)
	// try changing reduce to reduceByKey, which will fail
	// as reduceByKey is applicable to key-value pairs
	lrdd.reduceByKey( _+_)
	#Analysing from Kafka topic:
	#This below script can be put in a scala script name.scala and run from spark-shell
	#This can be created as a scala project as well, remove the comments, use the dependencies below in build.sbt and compile

	import kafka.serializer.StringDecoder
	import org.apache.spark.streaming._
	import org.apache.spark.streaming.kafka._
	import org.apache.spark.SparkConf
	import org.apache.spark.streaming.dstream.InputDStream
	import org.apache.spark.sql._

	1.MapJoin:
	small tables can be loaded in memory and joined with bigger tables.
	1. use hint /+ MAPJOIN(table_name) /
	2. 'better' option - let hive do automatically by setting these properties:
	hive.auto.convert.join - true
	hive.mapjoin.smalltable.filesize = <> default is 25MB

	2.Partition Design
	Low cardinality column -eg, regiou, year
	-------------kafka notes-----------
	why?
	better throughput
	Replication
	built-in partitioning
	Fault tolerance


	topics are unique!!
	location of a message -> topic - partition - offset
	1.Producer
	1.request.required.acks=[0,1,all/-1] 0 no acknowledgement but ver fast, 1 acknowledged after leader commits, all acknowledged after replicated

	2.use Async producer - use callback for the acknowledgement, using property producer.type=1
	3.Batching data - send multiple messages together.
	batch.num.messages
	queue.buffer.max.ms
	4.Compression for Large files - gzip, snappy supported
	very large files can be stored in shared location and just the file path can be logged by the kafka producer.
	env # to get all env variables

	#Find class files in a jar
	all_hdp_classes () {
	find -L /usr/hdp/current -maxdepth 20 -name "*.jar" -print \| while read line; do
	for i in `jar tf $line \| grep .class`
	do
	echo $line : $i
	done
	done
	1.mapPartition() instead of map() - when some expensive initializations like DBconnection need to be done

	2.RDD Parallelism: for No parent RDDs, example, sc.parallelize(',,,',4),Unless specified YARN will try to use as many CPU cores as available
	This could be tuned using spark.default.parallelism property.

	- to find default parallelism use sc.defaultParallelism
	rdd.getNumPartitions()
	rdd = sc.parallelize(<value>, numSlices=4)
	rdd.getNumPartitions() will return 4
	env # to get all env variables


	*******to work as root***********
	su -
	**************ifconfig synonyms------------
	ip address show or ip a s or ip a s eth0
	**********formatted file name**********
	cp a.txt a_$(date +%F).txt