Hadoop Tutorial Series, Issue #4: To Use Or Not To Use A Combiner

Welcome to the fourth issue of the Hadoop Tutorial Series. Combiners are another important Hadoop’s feature that every hadoop developer should be aware of. The primary goal of combiners is to optimize/minimize the number of key value pairs that will be shuffled accross the network between mappers and reducers and thus to save as most bandwidth as possible.

Indeed, to give you the intuition of why combiner helps reducing the number of data sent to the reducers, imagine the word count example on a text containing one million times the word “the”. Without combiner the mapper will send one million key/value pairs of the form <the,1>. With combiners, it will potentially send much less key/value pairs of the form <the,N> with N a number potentially much bigger than 1. That’s just the intuition (see the references at the end of the post for more details).

Simply speaking a combiner can be considered as a “mini reducer” that will be applied potentially several times still during the map phase before to send the new (hopefully reduced) set of key/value pairs to the reducer(s). This is why a combiner must implement the Reducer interface (or extend the Reducer class as of hadoop 0.20).

In general you can even use the same reducer method as both your reducer and your combiner. This is the case for the word count example where using a combiner remains to add a single line of code in your main method:

conf.setCombinerClass(Reduce.class);

where conf is your JobConf, or, if you use hadoop 0.20.1:

job.setCombinerClass(Reduce.class);

where job is your Job built with a customized Configuration.

That sounds pretty simple and useful and at first look you would be ready to use combiners all the time by adding this simple line, but there is a small catch. The first kind of reducers that comes naturally as a counter example of using combiner is the “mean reducer” that computes the mean of all the values associated with an given key.

Indeed, suppose 5 key/value pairs emitted from the mapper for a given key k: <k,40>, <k,30>, <k,20>, <k,2>, <k,8>. Without combiner, when the reducer will receive the list <k,{40,30,20,2,8}>, the mean output will be 20, but if a combiner were applied before on the two sets (<k,40>, <k,30>, <k,20>) and (<k,2>, <k,8>) separately, then the reducer would have received the list <k,{30,5}> and the output would have been different (17.5) which is an unexpected behavior.

More generally, combiners can be used when the function you want to apply is both commutative and associative (that’s pretty intuitive to understand why). That’s the case for the addition function, this is why the word count example can benefit from combiners but not for the mean function (which is not associative as shown in the counter example above).

Note that for the mean function you can use a workaround for using combiners by using two separate reduce methods, a first one that would be used as the addition function (and thus that can be set as the combiner) that would emit the intermediate sum as the key and the number of addition involved as the value, and a second reduce function that would compute the mean by taking into account the number of addition involved (see the references for more details on that).

As usual in this series, let’s observe the lesson learned in action using our learning playground. For that you can use the original word count example (or its hadoop 0.20.1 version that we used in the previous issue), add it the single combine line as specified earlier in the post and run it on our moby-dick mascot. Here what we can see at the end of the execution:

Output of the word count example when using a combiner.

Now that you understand what counters are, if you click to enlarge the picture, you’ll see the value of two counters: Combine input records=215137 and Combine output records=33783. That’s a pretty serious reduction of the number of key/value pairs to send to the reducers. You can easily imagine the impact for much larger jobs (see the reference below for real numbers).

Enjoy combiners, whenever you can…

References

See the 4th tip of this must read blog post by Todd Lipcon for feeling better the benefit of combiners on a 40GB wordcount job benchmark.
For a deeper understanding of when and how combiners are used in the mapReduce data flow, check this section of the (quiet heavy but) excellent Yahoo! hadoop tutorial.
To extend the intuition given in the post on why combiners help, you can go over this walk-through.
Both Hadoop the definitive guide and Hadoop in Action contains interesting information on combiners (part of both of them inspired this post). In particular the first contains a great section on when exactly the combiners comes into play in the mapReduce data flow. The second contains a full code of the mean function workaround mentioned above.

Comments

2 responses to “Hadoop Tutorial Series, Issue #4: To Use Or Not To Use A Combiner”

MultiCombiner | LiveRamp Blog

February 12, 2014

[…] Combiners are a useful tool for optimizing Hadoop jobs by reducing the number of records which need to be shuffled, decreasing sorting time, disk I/O, and network traffic on your cluster. Combiners work by partially reducing on records during the map phase which have the same key, and emitting the partial result rather than the original records, in essence doing some of the work of the reducer ahead of time. Combiners decrease the number of records generated by each map task, reducing the amount of data that needs to be shuffled and often providing a significant decrease in time spent on I/O. Because each combiner only has access to a subset of the records for each key in a random order, combiners are only useful for operations which are both commutative and associative, such as summing or counting. For a quick combiner tutorial check out this combiner tutorial. […]
Gregory Smith

October 9, 2014

I love your blog…

I have read this article and enjoyed it…