{"id":664,"date":"2009-12-20T11:27:44","date_gmt":"2009-12-20T16:27:44","guid":{"rendered":"http:\/\/philippeadjiman.com\/blog\/?p=664"},"modified":"2025-07-18T13:53:33","modified_gmt":"2025-07-18T13:53:33","slug":"hadoop-tutorial-series-issue-2-getting-started-with-customized-partitioning","status":"publish","type":"post","link":"https:\/\/philippeadjiman.com\/blog\/2009\/12\/20\/hadoop-tutorial-series-issue-2-getting-started-with-customized-partitioning\/","title":{"rendered":"Hadoop Tutorial Series, Issue #2: Getting Started With (Customized) Partitioning"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">In the <a href=\"http:\/\/philippeadjiman.com\/blog\/2009\/12\/07\/hadoop-tutorial-part-1-setting-up-your-mapreduce-learning-playground\/\" target=\"_blank\" rel=\"noreferrer noopener\">Issue #1<\/a> of <a href=\"http:\/\/philippeadjiman.com\/blog\/the-hadoop-tutorial-series\/\" target=\"_blank\" rel=\"noreferrer noopener\">this series<\/a>, we set up the &#8220;learning playground&#8221; (based on the <a href=\"http:\/\/www.cloudera.com\/hadoop-training-virtual-machine\" target=\"_blank\" rel=\"noreferrer noopener\">Cloudera Virtual Machine<\/a>) in order to enjoy hands-on learning experiences around Hadoop. In this issue, we&#8217;ll use our playground to investigate the partitioning features offered by Hadoop.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>What is it all about?<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">As you may know, a map\/reduce job will contains most of the time more than&nbsp; 1 reducer.&nbsp; So basically, when a mapper emits a key value pair, it has to be sent to one of the reducers. Which one? The mechanism sending specific key-value pairs to specific reducers is called <strong>partitioning<\/strong> (the key-value pairs space is partitioned among the reducers). A <a href=\"http:\/\/hadoop.apache.org\/common\/docs\/current\/api\/org\/apache\/hadoop\/mapreduce\/Partitioner.html\" target=\"_blank\" rel=\"noreferrer noopener\">Partitioner <\/a>is responsible to perform the partitioning.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In Hadoop, the default partitioner is <a href=\"http:\/\/hadoop.apache.org\/common\/docs\/current\/api\/org\/apache\/hadoop\/mapreduce\/lib\/partition\/HashPartitioner.html\" target=\"_blank\" rel=\"noreferrer noopener\">HashPartitioner<\/a>, which hashes a record&#8217;s key to determine which partition (and thus which reducer) the record belongs in.The number of partition is then equal to the number of reduce tasks for the job.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Why is it important?<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">First, it has a direct impact on the overall performance of your job: a poorly designed partitioning function will not evenly distributes the charge over the reducers, potentially loosing all the interest of the map\/reduce distributed infrastructure.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Second, it maybe sometimes necessary to control the key\/value pairs partitioning over the reducers. Let&#8217;s illustrate it on a simple example. Suppose that your job&#8217;s input is a (huge) set of tokens and their number of occurrences (for instance the output of the canonical <a href=\"http:\/\/hadoop.apache.org\/common\/docs\/current\/mapred_tutorial.html#Example%3A+WordCount+v1.0\" target=\"_blank\" rel=\"noreferrer noopener\">word count hadoop example<\/a>) and that you want to sort them by number of occurrences. Let&#8217;s also suppose that your job will be handled by 2 reducers. If you run your job without using any customized partitioner, you&#8217;ll get something like this:<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img data-recalc-dims=\"1\" decoding=\"async\" width=\"839\" height=\"483\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2009\/12\/partialSortOn2Reducers2.jpg?resize=839%2C483&#038;ssl=1\" alt=\"\" class=\"wp-image-1928\" srcset=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2009\/12\/partialSortOn2Reducers2.jpg?w=839&amp;ssl=1 839w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2009\/12\/partialSortOn2Reducers2.jpg?resize=300%2C173&amp;ssl=1 300w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2009\/12\/partialSortOn2Reducers2.jpg?resize=768%2C442&amp;ssl=1 768w\" sizes=\"auto, (max-width: 839px) 100vw, 839px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">As you can see, the tokens are correctly ordered by number of occurrences on each reducer (which is what hadoop guarantees by default) but this is not what you need! You&#8217;d rather expect something like:<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img data-recalc-dims=\"1\" decoding=\"async\" width=\"794\" height=\"445\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2009\/12\/TotalSortOn2Reducers.jpg?resize=794%2C445&#038;ssl=1\" alt=\"\" class=\"wp-image-1929\" srcset=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2009\/12\/TotalSortOn2Reducers.jpg?w=794&amp;ssl=1 794w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2009\/12\/TotalSortOn2Reducers.jpg?resize=300%2C168&amp;ssl=1 300w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2009\/12\/TotalSortOn2Reducers.jpg?resize=768%2C430&amp;ssl=1 768w\" sizes=\"auto, (max-width: 794px) 100vw, 794px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">where tokens are totally ordered over the reducers, from 1 to 30 occurrences on the first reducer and from 31 to 14620 on the second. This would happen as a result of a correct partitioning function: all the tokens having a number of occurrences inferior to N (here 30) are sent&nbsp; to reducer 1 and the others are sent to reducer 2, resulting in two partitions. Since the tokens are sorted on each partition, you get the expected total order on the number of occurrences.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Below, we&#8217;ll use our playground to observe the issue happening&nbsp; on real data and see how we solve it using customized partitioners.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Also, as a second example of use of customized partitioning functions, let&#8217;s cite the <a href=\"http:\/\/labs.google.com\/papers\/mapreduce.html\" target=\"_blank\" rel=\"noreferrer noopener\">original map\/reduce google paper<\/a>: &#8220;sometimes the output keys are URLs, and we want all entries for a single host to end up in the same output. To support situations like this, the user of the MapReduce library can provide a special partitioning function. For example, using &#8220;hash(Hostname(urlkey)) mod R&#8221; as the partitioning function causes all URLs from the same host to end up in the same output&#8221;.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Feeling the partitions in our playground<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">If your playground is not yet set up, check the <a href=\"http:\/\/philippeadjiman.com\/blog\/2009\/12\/07\/hadoop-tutorial-part-1-setting-up-your-mapreduce-learning-playground\/\" target=\"_blank\" rel=\"noreferrer noopener\">Issue #1 of this series<\/a>. As an input for our job, we&#8217;ll use a tsv file containing the list of tokens and their number of occurrences extracted from (<a href=\"http:\/\/philippeadjiman.com\/blog\/2009\/10\/26\/drawing-the-long-tail-of-a-zipf-law-using-gnuplot-java-and-moby-dick\/\" target=\"_blank\" rel=\"noreferrer noopener\">once again<\/a>) the full moby dick text. <a href=\"http:\/\/www.philippeadjiman.com\/blog\/wp-content\/uploads\/2009\/12\/wordcount_moby-dick1.tsv\" target=\"_blank\" rel=\"noreferrer noopener\">Click here<\/a> to download this input. You&#8217;ll notice that the pairs (tokens, #occurrences) are alphanumerically sorted on tokens value.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">First, we&#8217;ll use a very simple pre-processing job to transform the input data into a more convenient format to use within hadoop: the Sequence File Output Format. Sequence files are a basic file based data structure persisting the key\/value pairs in a binary format and allowing you to interact more easily with basic hadoop data types (e.g <a href=\"http:\/\/hadoop.apache.org\/common\/docs\/r0.17.0\/api\/org\/apache\/hadoop\/io\/IntWritable.html\" target=\"_blank\" rel=\"noreferrer noopener\">IntWritable<\/a>, <a href=\"http:\/\/hadoop.apache.org\/common\/docs\/r0.17.0\/api\/org\/apache\/hadoop\/io\/LongWritable.html\" target=\"_blank\" rel=\"noreferrer noopener\">LongWritable<\/a>, etc&#8230;). Here is the simple pre-processing job:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">package com.philippeadjiman.hadooptraining;\n\nimport java.io.IOException;\n\nimport org.apache.hadoop.fs.Path;\nimport org.apache.hadoop.io.IntWritable;\nimport org.apache.hadoop.io.LongWritable;\nimport org.apache.hadoop.io.Text;\nimport org.apache.hadoop.mapred.FileInputFormat;\nimport org.apache.hadoop.mapred.FileOutputFormat;\nimport org.apache.hadoop.mapred.JobClient;\nimport org.apache.hadoop.mapred.JobConf;\nimport org.apache.hadoop.mapred.MapReduceBase;\nimport org.apache.hadoop.mapred.Mapper;\nimport org.apache.hadoop.mapred.OutputCollector;\nimport org.apache.hadoop.mapred.Reporter;\nimport org.apache.hadoop.mapred.SequenceFileOutputFormat;\n\npublic class SortDataPreprocessor {\n\n\tstatic class PreprocessorMapper extends MapReduceBase implements Mapper {\n\n\t\tprivate Text word = new Text();\n\n\t\tpublic void map(LongWritable key, Text value,\n\t\t\t\tOutputCollector output, Reporter reporter) throws IOException {\n\t\t\tString line = value.toString();\n\t\t\tString[] tokens = line.split(\"t\");\n\t\t\tif( tokens == null || tokens.length != 2 ){\n\t\t\t\tSystem.err.print(\"Problem with input line: \"+line+\"n\");\n\t\t\t\treturn;\n\t\t\t}\n\t\t\tint nbOccurences = Integer.parseInt(tokens[1]);\n\t\t\tword.set(tokens[0]);\n\t\t\toutput.collect(new IntWritable(nbOccurences),word );\n\t\t}\n\t}\n\n\tpublic static void main(String[] args) throws IOException {\n\t\tJobConf conf = new JobConf(SortDataPreprocessor.class);\n\n\t\tFileInputFormat.setInputPaths(conf, new Path(args[0]));\n\t\tFileOutputFormat.setOutputPath(conf, new Path(args[1]));\n\n\t\tconf.setMapperClass(PreprocessorMapper.class);\n\t\tconf.setOutputKeyClass(IntWritable.class);\n\t\tconf.setOutputValueClass(Text.class);\n\t\tconf.setNumReduceTasks(0);\n\t\tconf.setOutputFormat(SequenceFileOutputFormat.class);\n\t\tJobClient.runJob(conf);\n\t}\n}<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">You&#8217;ll notice that:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"\">it contains only a mapper (no reducer),<\/li>\n\n\n\n<li class=\"\">a basic error management is performed for potential malformed lines,<\/li>\n\n\n\n<li class=\"\">the output key is the number of occurrences (as an IntWritable) and the output value is the associated token,<\/li>\n\n\n\n<li class=\"\">The sequence file output format is specified using setOutputFormat(SequenceFileOutputFormat.class);<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">It is sildenafil citrate medicine, which enhances blood flow near reproductive area and makes the organ becoming erect with fuller, viagra soft tabs <a href=\"http:\/\/icks.org\/n\/bbs\/content.php?co_id=2017\">icks.org<\/a> thicker and firm erections, so theseare erection-helping medicines. Not just children can be determined to have a dysfunctional <a href=\"http:\/\/icks.org\/n\/bbs\/content.php?co_id=INAUGURAL_ISSUE_1997\">http:\/\/icks.org\/n\/bbs\/content.php?co_id=INAUGURAL_ISSUE_1997<\/a> best generic viagra behavior. The unhealthful eating habits and cialis price in canada <a href=\"http:\/\/icks.org\/n\/bbs\/content.php?co_id=2013&amp;mcode=30&amp;smcode=3070\">discount here<\/a> lifestyle choices that led to a decreased utilization of higher expense processes. As we age the prenatal life force or chi is drained out from the naval and kidney areas as we move further into the ego, the illusion, and dysfunctional breathing patterns. <a href=\"http:\/\/icks.org\/n\/bbs\/content.php?co_id=FALL_WINTER_2018\">buy soft cialis<\/a><br>To run it, package the job using maven (see <a href=\"http:\/\/philippeadjiman.com\/blog\/2009\/12\/07\/hadoop-tutorial-part-1-setting-up-your-mapreduce-learning-playground\/\" target=\"_blank\" rel=\"noreferrer noopener\">Issue #1<\/a>), put the input file on hdfs in an input directory (let&#8217;s call it input) and execute:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">hadoop jar playing-with-partitions.jar com.philippeadjiman.hadooptraining.SortDataPreprocessor \/user\/training\/input \/user\/training\/pre_process<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">This will create a directory called &#8220;pre_process&#8221; on hdfs containing a set of pairs (#occurrences,token), respectively of format <a href=\"http:\/\/hadoop.apache.org\/common\/docs\/r0.17.0\/api\/org\/apache\/hadoop\/io\/IntWritable.html\" target=\"_blank\" rel=\"noreferrer noopener\">IntWritable<\/a> and <a href=\"http:\/\/hadoop.apache.org\/common\/docs\/r0.17.0\/api\/org\/apache\/hadoop\/io\/Text.html\" target=\"_blank\" rel=\"noreferrer noopener\">Text<\/a>, in a <a href=\"http:\/\/hadoop.apache.org\/common\/docs\/current\/api\/org\/apache\/hadoop\/mapred\/SequenceFileOutputFormat.html\" target=\"_blank\" rel=\"noreferrer noopener\">SequenceFileOutputFormat<\/a>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Now we can, perform the sort based on this new input. Writing a job for such a task is actually trivial since this is primarily what hadoop is doing by default, so here it is:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">package com.philippeadjiman.hadooptraining;\n\nimport java.io.IOException;\n\nimport org.apache.hadoop.fs.Path;\nimport org.apache.hadoop.io.IntWritable;\nimport org.apache.hadoop.io.Text;\nimport org.apache.hadoop.mapred.FileInputFormat;\nimport org.apache.hadoop.mapred.FileOutputFormat;\nimport org.apache.hadoop.mapred.JobClient;\nimport org.apache.hadoop.mapred.JobConf;\nimport org.apache.hadoop.mapred.SequenceFileInputFormat;\n\npublic class SortExample {\n\tpublic static void main(String[] args) throws IOException {\n\t\tJobConf conf = new JobConf(SortExample.class);\n\t\tconf.setJobName(\"sortexample\");\n\n\t\tFileInputFormat.setInputPaths(conf, new Path(args[0]));\n\t\tFileOutputFormat.setOutputPath(conf, new Path(args[1]));\n\n\t\tconf.setInputFormat(SequenceFileInputFormat.class);\n\t\tconf.setOutputKeyClass(IntWritable.class);\n\t\tconf.setOutputValueClass(Text.class);\n\n\t\tconf.setNumReduceTasks(2);\n\n\t\tJobClient.runJob(conf);\n\t}\n}<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">You&#8217;ll notice that:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"\">There is neither map nor reduce methods! This is because sorting is a default behavior so we don&#8217;t have to do anything (we&#8217;re just interested here to see how it&#8217;ll be partitioned),<\/li>\n\n\n\n<li class=\"\">The input\/output formats are specified based on the output of our pre-processing job,<\/li>\n\n\n\n<li class=\"\">We explicitly set the number of reducer to 2, which is the important part here since we want to observe how the output will be partitioned (without specifying it, the output will be generated using only one reducer).<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Just run it using:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">hadoop jar playing-with-partitions.jar com.philippeadjiman.hadooptraining.SortExample \/user\/training\/pre_process \/user\/training\/output<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Once completed, an output directory will be created on hdfs with two files, one for each reducer that were used. You can observe the content of the output using commands like:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">hadoop fs -cat output\/part-00000 | less\nhadoop fs -cat output\/part-00001 | less<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">As you&#8217;ll see, the two outputs are sorted but do not represent a total order, as explained above. Let&#8217;s fix it.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>How to implement your own partitioning function<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">So how do we create a total order out of those two reducers?<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">A first solution is to create our own partitionner which is as simple as implementing the <a href=\"http:\/\/hadoop.apache.org\/common\/docs\/r0.18.3\/api\/org\/apache\/hadoop\/mapred\/Partitioner.html\" target=\"_blank\" rel=\"noreferrer noopener\">Partitioner&lt;K,V&gt;<\/a> interface:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">package com.philippeadjiman.hadooptraining;\n\npackage com.philippeadjiman.hadooptraining;\n\nimport org.apache.hadoop.io.IntWritable;\nimport org.apache.hadoop.io.Text;\nimport org.apache.hadoop.mapred.JobConf;\nimport org.apache.hadoop.mapred.Partitioner;\n\npublic class MyPartitioner implements Partitioner&lt;IntWritable,Text&gt; {\n\t@Override\n\tpublic int getPartition(IntWritable key, Text value, int numPartitions) {\n\t\t\/* Pretty ugly hard coded partitioning function. Don't do that in practice, it is just for the sake of understanding. *\/\n\t\tint nbOccurences = key.get();\n\n\t\tif( nbOccurences &lt; 3 )\n\t\t\treturn 0;\n\t\telse\n\t\t\treturn 1;\n\t}\n\n\t@Override\n\tpublic void configure(JobConf arg0) {\n\n\t}\n\n}<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">This implementation of getPartition specifies to put all the pairs having a key (which is here the number of occurrences) being less than 3 into the first partition and the other into the second one. This is of course a pretty bad practice to hard code like that a partitioning function but this is simply for the sake of understanding.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To use this created partition just add the following line to the main method of the previous SortExample class:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">conf.setPartitionerClass(MyPartitioner.class);<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Why did I choose 3? Because as a side effect of the Zipf law, the number of tokens having a number of occurrences of 1 and 2 will be as much as big that all the others together! (to see a Zipf Law in action, <a href=\"http:\/\/philippeadjiman.com\/blog\/2009\/10\/26\/drawing-the-long-tail-of-a-zipf-law-using-gnuplot-java-and-moby-dick\/\" target=\"_blank\" rel=\"noreferrer noopener\">check this post<\/a>). So 3 was chosen just for balancing a little bit the partitions.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Re-package the code with the customized partition, remove the old output, run it again and check that our problem is solved: there is a total order over the partitions.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>How to automatically find &#8220;good&#8221; partitioning function using sampling<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Now, as I mentioned above, it is a pretty bad practice to hard code how to partition the keys. But on the other hand, how would I know automatically and in advance how to divide the partition in the general case?<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Hadoop provide a nice way to approximate <em>a priori<\/em> a good partitioning function using an <a href=\"http:\/\/hadoop.apache.org\/common\/docs\/current\/api\/org\/apache\/hadoop\/mapred\/lib\/InputSampler.html\" target=\"_blank\" rel=\"noreferrer noopener\">InputSampler<\/a>. For instance, a <a href=\"http:\/\/hadoop.apache.org\/common\/docs\/current\/api\/org\/apache\/hadoop\/mapred\/lib\/InputSampler.RandomSampler.html\" target=\"_blank\" rel=\"noreferrer noopener\">RandomSampler<\/a>, will sample the input at random to estimate what is the best way to partition. The sampler will wrote into a file called, by default, _patition.lst describing the partition that the job will automatically use to decide which key\/value pairs to send to which reducers. This mechanism has to be used in combination with a <a href=\"http:\/\/hadoop.apache.org\/common\/docs\/r0.19.0\/api\/org\/apache\/hadoop\/mapred\/lib\/TotalOrderPartitioner.html\" target=\"_blank\" rel=\"noreferrer noopener\">TotalOrderPartitioner<\/a>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Here is a code sample using such a sampler with a total order partitioner:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">InputSampler.Sampler&lt;IntWritable, Text&gt; sampler =\n\tnew InputSampler.RandomSampler&lt;IntWritable, Text&gt;(0.1, 100);\nInputSampler.writePartitionFile(conf, sampler);\nconf.setPartitionerClass(TotalOrderPartitioner.class);<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Sometimes, there are some issues with the file _partition.lst that is not found. It always worked for me when I specified explicitly where to find the file using the <a href=\"http:\/\/hadoop.apache.org\/common\/docs\/r0.19.0\/api\/org\/apache\/hadoop\/mapred\/lib\/TotalOrderPartitioner.html#setPartitionFile%28org.apache.hadoop.mapred.JobConf,%20org.apache.hadoop.fs.Path%29\" target=\"_blank\" rel=\"noreferrer noopener\">TotalOrderPartitioner.setPartitionFile();<\/a> method. Also pay attention to invoke this method before the call to writePartitionFile. Also, note that the sampling mechanism is necessary since considering all the input to compute the partition would be inefficient for large files.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Some remarks<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"\">A customized partitioning would not have been necessary if we had only one reducer since all the key\/value pairs would have ended into the same output file. It is easy to understand that such a constraint is a nonsense and that using more than one reducer is most of the time necessary, else the map\/reduce concept would not be very useful&#8230;<\/li>\n\n\n\n<li class=\"\">Even if we used a small dataset on the semi-distributed infrastructure of the <a href=\"http:\/\/www.cloudera.com\/hadoop-training-virtual-machine\" target=\"_blank\" rel=\"noreferrer noopener\">cloudera virtual machine<\/a> to observe partitioning in action and to learn how to customize it, the same concepts can be applied to a larger infrastructure. To see a very interesting use case of customized partitioning strategy for sorting purpose on a big infrastructure, check the famous <a href=\"http:\/\/sortbenchmark.org\/YahooHadoop.pdf\" target=\"_blank\" rel=\"noreferrer noopener\">TeraByte sort on hadoop<\/a>.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Conclusion<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Partitioning in map\/reduce is a fairly simple concept but that is important to get correctly. Most of the time, the default partitioning based on an hash function can be sufficient. But as we illustrated in this Issue, you&#8217;ll need sometime to modify the default behavior and to customize your own partitioning suited for your needs.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">If you have some questions, or if you have experimented other use cases of customized partitioning in your application, please comment\/share. See you soon for another Issue.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Teaches key partitioning patterns (e.g., partial sorts to specific reducers) to control data flow in MapReduce jobs.<\/p>\n","protected":false},"author":1,"featured_media":1932,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"nf_dc_page":"","_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[10,17],"tags":[26,37],"class_list":["post-664","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-hadoop","category-tutorial","tag-hadoop","tag-tutorial"],"jetpack_featured_media_url":"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2009\/12\/withcounts-flat.png.webp?fit=630%2C543&ssl=1","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/posts\/664","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/comments?post=664"}],"version-history":[{"count":2,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/posts\/664\/revisions"}],"predecessor-version":[{"id":1964,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/posts\/664\/revisions\/1964"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/media\/1932"}],"wp:attachment":[{"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/media?parent=664"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/categories?post=664"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/tags?post=664"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}