Created
June 17, 2017 05:36
-
-
Save jamesrajendran/d911aeaa40c9d246328d9485da957f8b to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
General tunable units: | |
memory, Disk IO, network bandwidth, CPU, | |
Most hadoop tasks are not CPU bounded. | |
Network bandwidth tuning potential is quite limited 2% | |
1.Memory tuning | |
general rule - use as much memory as available without triggering swapping | |
2.Disk IO: | |
the biigest bottleneck. | |
-compress mapper output - try to reduce mapper output size as much as possible | |
-filter out unneccessary data | |
-prutn the key and value - make them as narrow as possible | |
-use 70% heap memory for spill buffer in mapper | |
properties: mapred.compress.map.output - True | |
mapred.map.output.compression.codec - com.hadoop.compression.lzo.lzocodec | |
io.sort.mb - 800 | |
3. Tuning Mapper Tasks: | |
number of mappers set implicitly by block size/input split size | |
each mapper uses one jvm - fewer the mappers, fewer the jvms created and destroyed - incrase mapred.min.split.size | |
But if you have more idle mappers smaller split size is better. | |
4. Use CombinedFileInputFormat for very small files | |
5. Reducer balancing. | |
if one, two reducer takes lot longer than others, try a better partitioner that would evenly distribute the work load. | |
mapreduce.job.reducers - property to modify as needed | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment