jamesrajendran/Hive performance Tuning

## Hive performance Tuning

	1.MapJoin:
	small tables can be loaded in memory and joined with bigger tables.
		1. use hint /*+ MAPJOIN(table_name)  */
		2. 'better' option - let hive do automatically by setting these properties:
		hive.auto.convert.join	-	true
		hive.mapjoin.smalltable.filesize = <>	default is 25MB

	2.Partition Design
		Low cardinality column -eg, regiou, year

	3.Bucket Design:
		For high cardinality columns like itineraryIds, and joining tables with the similar Ids.
			both tables be bucketed on the same column.
			The number of buckets in both tables be strict multiple of each other.
			1.set hive.enforce.bucketing=true
			2.set hive.optimize.bucketmapjoin=true
			3.use hint - /*+ MAPJOIN(a,b) */
	4.BucketSampling:
			Due to hashing, simiar ids will be grouped in the similar bucket ids.
			use tablesample clause and pass same bucket number for both tables - only those two buckets will be sampled and they will have related rows as well.
			select ... from <table1> TABLESAMPLE(bucket 30 out of 64 on <bucketed column>)
							, <table2> TABLESAMPLE(bucket 30 out of 64 on <bucketed column>)
						where ...
	5.BlockSampling:
		For single table analysis - no buckets
		can sample with % of bytes, number of bytes, number of rows
		eg: TABLESAMPLE(1 PERCENT), TABLESAMPLE(10 ROWS), TABLESAMPLE(100M)


	6. 	Hive parallel Executeion:
			Hive genrally breaks down the work into multiple MRs and run them sequentially - set hive.exec.parallel=false
			But some of these MRs can run independently, they will run in parallel - set hive.exec.parallel = true

	1.MapJoin:
	small tables can be loaded in memory and joined with bigger tables.
	1. use hint /+ MAPJOIN(table_name) /
	2. 'better' option - let hive do automatically by setting these properties:
	hive.auto.convert.join - true
	hive.mapjoin.smalltable.filesize = <> default is 25MB

	2.Partition Design
	Low cardinality column -eg, regiou, year

	3.Bucket Design:
	For high cardinality columns like itineraryIds, and joining tables with the similar Ids.
	both tables be bucketed on the same column.
	The number of buckets in both tables be strict multiple of each other.
	1.set hive.enforce.bucketing=true
	2.set hive.optimize.bucketmapjoin=true
	3.use hint - /+ MAPJOIN(a,b) /
	4.BucketSampling:
	Due to hashing, simiar ids will be grouped in the similar bucket ids.
	use tablesample clause and pass same bucket number for both tables - only those two buckets will be sampled and they will have related rows as well.
	select ... from <table1> TABLESAMPLE(bucket 30 out of 64 on <bucketed column>)
	, <table2> TABLESAMPLE(bucket 30 out of 64 on <bucketed column>)
	where ...
	5.BlockSampling:
	For single table analysis - no buckets
	can sample with % of bytes, number of bytes, number of rows
	eg: TABLESAMPLE(1 PERCENT), TABLESAMPLE(10 ROWS), TABLESAMPLE(100M)


	6. Hive parallel Executeion:
	Hive genrally breaks down the work into multiple MRs and run them sequentially - set hive.exec.parallel=false
	But some of these MRs can run independently, they will run in parallel - set hive.exec.parallel = true