jamesrajendran/Hadoop Performance Tuning

## Hadoop Performance Tuning
General tunable units:
	 memory, Disk IO, network bandwidth,  CPU,
	Most hadoop tasks are not CPU bounded.
	Network bandwidth tuning potential is quite limited 2%
	1.Memory tuning
			general rule - use as much memory as available without triggering swapping
	2.Disk IO:
		the biigest bottleneck.
		-compress mapper output - try to reduce mapper output size as much as possible
		-filter out unneccessary data
		-prutn the key and value - make them as narrow as possible
		-use 70% heap memory for spill buffer in mapper
		properties: mapred.compress.map.output - True
					mapred.map.output.compression.codec - com.hadoop.compression.lzo.lzocodec
					io.sort.mb - 800
	3.	Tuning Mapper Tasks:
		number of mappers set implicitly by block size/input split size
		each mapper uses one jvm - fewer the mappers, fewer the jvms created and destroyed - incrase mapred.min.split.size

		But if you have more idle mappers smaller split size is better.
	4.	Use CombinedFileInputFormat for very small files

	5. Reducer balancing.
		if one, two reducer takes lot longer than others, try a better partitioner that would evenly distribute the work load.
		 mapreduce.job.reducers - property to modify as needed
	General tunable units:
	memory, Disk IO, network bandwidth, CPU,
	Most hadoop tasks are not CPU bounded.
	Network bandwidth tuning potential is quite limited 2%
	1.Memory tuning
	general rule - use as much memory as available without triggering swapping
	2.Disk IO:
	the biigest bottleneck.
	-compress mapper output - try to reduce mapper output size as much as possible
	-filter out unneccessary data
	-prutn the key and value - make them as narrow as possible
	-use 70% heap memory for spill buffer in mapper
	properties: mapred.compress.map.output - True
	mapred.map.output.compression.codec - com.hadoop.compression.lzo.lzocodec
	io.sort.mb - 800
	3. Tuning Mapper Tasks:
	number of mappers set implicitly by block size/input split size
	each mapper uses one jvm - fewer the mappers, fewer the jvms created and destroyed - incrase mapred.min.split.size

	But if you have more idle mappers smaller split size is better.
	4. Use CombinedFileInputFormat for very small files

	5. Reducer balancing.
	if one, two reducer takes lot longer than others, try a better partitioner that would evenly distribute the work load.
	mapreduce.job.reducers - property to modify as needed