In a classic hadoop job, you've got mappers and reducers. The "thing" being mapped and reduced are key-value pairs for some arbitrary pair of types. Most of your parallelism comes from the mappers, since they can (ideally) split the data and transform it without any coordination with other processes.
By contrast, the amount of parallelism in the reduction phase has an important limitation: although you may have many reducers, any given reducer is guaranteed to receive all the values for some particular key.
So if there are a HUGE number of values for some particular key, you're going to have a bottleneck because they're all going to be processed by a single reducer.
However, there is another way! Certain types of data fit into a pattern:
- they can be combined with other values of the same type to form new values.
- the combining operation is associative. For example, integer addition: ((1 + 2) + 3) == (1 + (2 + 3)) - they have an identity value. (f