{"id":848,"date":"2010-01-14T06:12:31","date_gmt":"2010-01-14T11:12:31","guid":{"rendered":"http:\/\/philippeadjiman.com\/blog\/?p=848"},"modified":"2025-07-18T13:49:12","modified_gmt":"2025-07-18T13:49:12","slug":"hadoop-tutorial-series-issue-4-to-use-or-not-to-use-a-combiner","status":"publish","type":"post","link":"https:\/\/philippeadjiman.com\/blog\/2010\/01\/14\/hadoop-tutorial-series-issue-4-to-use-or-not-to-use-a-combiner\/","title":{"rendered":"Hadoop Tutorial Series, Issue #4: To Use Or Not To Use A Combiner"},"content":{"rendered":"\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img data-recalc-dims=\"1\" decoding=\"async\" width=\"800\" height=\"800\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2010\/01\/global-professional-uhf-8-way-passive-splittercombiner.png?resize=800%2C800&#038;ssl=1\" alt=\"\" class=\"wp-image-1910\" style=\"width:432px;height:auto\" srcset=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2010\/01\/global-professional-uhf-8-way-passive-splittercombiner.png?w=800&amp;ssl=1 800w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2010\/01\/global-professional-uhf-8-way-passive-splittercombiner.png?resize=300%2C300&amp;ssl=1 300w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2010\/01\/global-professional-uhf-8-way-passive-splittercombiner.png?resize=150%2C150&amp;ssl=1 150w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2010\/01\/global-professional-uhf-8-way-passive-splittercombiner.png?resize=768%2C768&amp;ssl=1 768w\" sizes=\"auto, (max-width: 800px) 100vw, 800px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Welcome to the fourth issue of the <a href=\"http:\/\/philippeadjiman.com\/blog\/the-hadoop-tutorial-series\/\" target=\"_blank\" rel=\"noreferrer noopener\">Hadoop Tutorial Series<\/a>. Combiners are another important Hadoop&#8217;s feature that every hadoop developer should be aware of. The primary goal of combiners is to optimize\/minimize the number of key value pairs that will be shuffled accross the network between mappers and reducers and thus to save as most bandwidth as possible.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Indeed, to give you the intuition of why combiner helps reducing the number of data sent to the reducers, imagine the <a href=\"http:\/\/hadoop.apache.org\/common\/docs\/current\/mapred_tutorial.html#Example%3A+WordCount+v1.0\" target=\"_blank\" rel=\"noreferrer noopener\">word count example<\/a> on a text containing one million times the word &#8220;the&#8221;. Without combiner the mapper will send one million key\/value pairs of the form <strong>&lt;the,1&gt;<\/strong>. With combiners, it will potentially send much less key\/value pairs of the form <strong>&lt;the,N&gt;<\/strong> with <em>N<\/em> a number potentially much bigger than 1. That&#8217;s just the intuition (see the references at the end of the post for more details).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Simply speaking a combiner can be considered as a &#8220;mini reducer&#8221; that will be applied potentially several times still during the map phase before to send the new (hopefully reduced) set of key\/value pairs to the reducer(s). This is why a combiner must implement the <a href=\"http:\/\/hadoop.apache.org\/common\/docs\/current\/api\/org\/apache\/hadoop\/mapred\/Reducer.html\" target=\"_blank\" rel=\"noreferrer noopener\">Reducer interface<\/a> (or extend the <a href=\"http:\/\/hadoop.apache.org\/common\/docs\/current\/api\/org\/apache\/hadoop\/mapreduce\/Reducer.html\" target=\"_blank\" rel=\"noreferrer noopener\">Reducer class<\/a> as of hadoop 0.20).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In general you can even use the same reducer method as both your reducer and your combiner. This is the case for the word count example where using a combiner remains to add a single line of code in your main method:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">conf.setCombinerClass(Reduce.class);<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">where conf is your <a href=\"http:\/\/hadoop.apache.org\/common\/docs\/current\/api\/org\/apache\/hadoop\/mapred\/JobConf.html\" target=\"_blank\" rel=\"noreferrer noopener\">JobConf<\/a>, or, if you use hadoop 0.20.1:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">job.setCombinerClass(Reduce.class);<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">where job is your <a href=\"http:\/\/hadoop.apache.org\/common\/docs\/r0.20.1\/api\/org\/apache\/hadoop\/mapreduce\/Job.html\" target=\"_blank\" rel=\"noreferrer noopener\">Job<\/a> built with a customized <a href=\"http:\/\/hadoop.apache.org\/common\/docs\/r0.20.1\/api\/org\/apache\/hadoop\/conf\/Configuration.html\" target=\"_blank\" rel=\"noreferrer noopener\">Configuration<\/a>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">That sounds pretty simple and useful and at first look you would be ready to use combiners all the time by adding this simple line, but there is a small catch. The first kind of reducers that comes naturally as a counter example of using combiner is the &#8220;mean reducer&#8221; that computes the mean of all the values associated with an given key.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Indeed, suppose 5 key\/value pairs emitted from the mapper for a given key <em>k<\/em>: <strong>&lt;k,40&gt;<\/strong>, <strong>&lt;k,30&gt;<\/strong>, <strong>&lt;k,20&gt;<\/strong>, <strong>&lt;k,2&gt;<\/strong>, <strong>&lt;k,8&gt;<\/strong>. Without combiner, when the reducer will receive the list <strong>&lt;k,{40,30,20,2,8}&gt;<\/strong>, the mean output will be <strong>20<\/strong>, but if a combiner were applied before on the two sets (<strong>&lt;k,40&gt;<\/strong>, <strong>&lt;k,30&gt;<\/strong>, <strong>&lt;k,20&gt;<\/strong>) and (<strong>&lt;k,2&gt;<\/strong>, <strong>&lt;k,8&gt;<\/strong>) separately, then the reducer would have received the list <strong>&lt;k,{30,5}&gt;<\/strong> and the output would have been different (<strong>17.5<\/strong>) which is an unexpected behavior.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">More generally, combiners can be used when the function you want to apply is both <a href=\"http:\/\/mathworld.wolfram.com\/Commutative.html\" target=\"_blank\" rel=\"noreferrer noopener\">commutative<\/a> and <a href=\"http:\/\/mathworld.wolfram.com\/Associative.html\" target=\"_blank\" rel=\"noreferrer noopener\">associative<\/a> (that&#8217;s pretty intuitive to understand why). That&#8217;s the case for the addition function, this is why the word count example can benefit from combiners but not for the mean function (which is not associative as shown in the counter example above).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Note that for the mean function you can use a workaround for using combiners by using two separate reduce methods, a first one that would be used as the addition function (and thus that can be set as the combiner) that would emit the intermediate sum as the key and the number of addition involved as the value, and a second reduce function that would compute the mean by taking into account the number of addition involved (see the references for more details on that).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">As usual in this series, let&#8217;s observe the lesson learned in action using our <a href=\"http:\/\/philippeadjiman.com\/blog\/2009\/12\/07\/hadoop-tutorial-part-1-setting-up-your-mapreduce-learning-playground\/\" target=\"_blank\" rel=\"noreferrer noopener\">learning playground<\/a>. For that you can use the original <a href=\"http:\/\/hadoop.apache.org\/common\/docs\/current\/mapred_tutorial.html#Example%3A+WordCount+v1.0\" target=\"_blank\" rel=\"noreferrer noopener\">word count example<\/a> (or its hadoop 0.20.1 version that we used in the <a href=\"http:\/\/philippeadjiman.com\/blog\/2010\/01\/07\/hadoop-tutorial-series-issue-3-counters-in-action\/\" target=\"_blank\" rel=\"noreferrer noopener\">previous issue<\/a>), add it the single combine line as specified earlier in the post and run it on our <a href=\"http:\/\/www.gutenberg.org\/files\/2701\/2701.txt\" target=\"_blank\" rel=\"noreferrer noopener\">moby-dick<\/a> mascot. Here what we can see at the end of the execution:<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img data-recalc-dims=\"1\" decoding=\"async\" width=\"897\" height=\"515\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2010\/01\/combiner1-1.jpg?resize=897%2C515&#038;ssl=1\" alt=\"\" class=\"wp-image-1902\" srcset=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2010\/01\/combiner1-1.jpg?w=897&amp;ssl=1 897w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2010\/01\/combiner1-1.jpg?resize=300%2C172&amp;ssl=1 300w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2010\/01\/combiner1-1.jpg?resize=768%2C441&amp;ssl=1 768w\" sizes=\"auto, (max-width: 897px) 100vw, 897px\" \/><figcaption class=\"wp-element-caption\">Output of the word count example when using a combiner.<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Now that you <a href=\"http:\/\/philippeadjiman.com\/blog\/2010\/01\/07\/hadoop-tutorial-series-issue-3-counters-in-action\/\" target=\"_blank\" rel=\"noreferrer noopener\">understand what counters are<\/a>, if you click to enlarge the picture, you&#8217;ll see the value of two counters: <strong>Combine input records=215137<\/strong> and <strong>Combine output records=33783<\/strong>. That&#8217;s a pretty serious reduction of the number of key\/value pairs to send to the reducers. You can easily imagine the impact for much larger jobs (see the reference below for real numbers).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Enjoy combiners, whenever you can&#8230;<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>References<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"\">See the 4th tip of this must read <a href=\"http:\/\/www.cloudera.com\/blog\/2009\/12\/17\/7-tips-for-improving-mapreduce-performance\/\" target=\"_blank\" rel=\"noreferrer noopener\">blog post<\/a> by Todd Lipcon for feeling better the benefit of combiners on a 40GB wordcount job benchmark.<\/li>\n\n\n\n<li class=\"\">For a deeper understanding of when and how combiners are used in the mapReduce data flow, check <a href=\"http:\/\/developer.yahoo.com\/hadoop\/tutorial\/module4.html#dataflow\" target=\"_blank\" rel=\"noreferrer noopener\">this section<\/a> of the (quiet heavy but) excellent <a href=\"http:\/\/developer.yahoo.com\/hadoop\/tutorial\/index.html\" target=\"_blank\" rel=\"noreferrer noopener\">Yahoo! hadoop tutorial<\/a>.<\/li>\n\n\n\n<li class=\"\">To extend the intuition given in the post on why combiners help, you can go over this <a href=\"http:\/\/hadoop.apache.org\/common\/docs\/current\/mapred_tutorial.html#Walk-through\" target=\"_blank\" rel=\"noreferrer noopener\">walk-through<\/a>.<\/li>\n\n\n\n<li class=\"\">Both <a href=\"http:\/\/www.amazon.com\/Hadoop-Definitive-Guide-Tom-White\/dp\/0596521979\" target=\"_blank\" rel=\"noreferrer noopener\">Hadoop the definitive guide<\/a> and <a href=\"http:\/\/www.manning.com\/lam\/\" target=\"_blank\" rel=\"noreferrer noopener\">Hadoop in Action<\/a> contains interesting information on combiners (part of both of them inspired this post). In particular the first contains a great section on when exactly the combiners comes into play in the mapReduce data flow. The second contains a full code of the mean function workaround mentioned above.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Explains when Hadoop Combiners help (or hurt) performance and correctness, with code\u2011level guidance. <\/p>\n","protected":false},"author":1,"featured_media":1910,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"nf_dc_page":"","_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[10,17],"tags":[26,37],"class_list":["post-848","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-hadoop","category-tutorial","tag-hadoop","tag-tutorial"],"jetpack_featured_media_url":"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2010\/01\/global-professional-uhf-8-way-passive-splittercombiner.png?fit=800%2C800&ssl=1","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/posts\/848","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/comments?post=848"}],"version-history":[{"count":3,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/posts\/848\/revisions"}],"predecessor-version":[{"id":1961,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/posts\/848\/revisions\/1961"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/media\/1910"}],"wp:attachment":[{"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/media?parent=848"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/categories?post=848"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/tags?post=848"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}