Skip to content

Instantly share code, notes, and snippets.

@jamesrajendran
Created June 17, 2017 05:36
Show Gist options
  • Save jamesrajendran/d911aeaa40c9d246328d9485da957f8b to your computer and use it in GitHub Desktop.
Save jamesrajendran/d911aeaa40c9d246328d9485da957f8b to your computer and use it in GitHub Desktop.
General tunable units:
memory, Disk IO, network bandwidth, CPU,
Most hadoop tasks are not CPU bounded.
Network bandwidth tuning potential is quite limited 2%
1.Memory tuning
general rule - use as much memory as available without triggering swapping
2.Disk IO:
the biigest bottleneck.
-compress mapper output - try to reduce mapper output size as much as possible
-filter out unneccessary data
-prutn the key and value - make them as narrow as possible
-use 70% heap memory for spill buffer in mapper
properties: mapred.compress.map.output - True
mapred.map.output.compression.codec - com.hadoop.compression.lzo.lzocodec
io.sort.mb - 800
3. Tuning Mapper Tasks:
number of mappers set implicitly by block size/input split size
each mapper uses one jvm - fewer the mappers, fewer the jvms created and destroyed - incrase mapred.min.split.size
But if you have more idle mappers smaller split size is better.
4. Use CombinedFileInputFormat for very small files
5. Reducer balancing.
if one, two reducer takes lot longer than others, try a better partitioner that would evenly distribute the work load.
mapreduce.job.reducers - property to modify as needed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment