{"id":818,"date":"2010-01-07T11:40:45","date_gmt":"2010-01-07T16:40:45","guid":{"rendered":"http:\/\/philippeadjiman.com\/blog\/?p=818"},"modified":"2025-07-18T13:49:35","modified_gmt":"2025-07-18T13:49:35","slug":"hadoop-tutorial-series-issue-3-counters-in-action","status":"publish","type":"post","link":"https:\/\/philippeadjiman.com\/blog\/2010\/01\/07\/hadoop-tutorial-series-issue-3-counters-in-action\/","title":{"rendered":"Hadoop Tutorial Series, Issue #3: Counters In Action"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\"><strong>Note<\/strong>: This post has been updated with a code working for hadoop 0.20.1.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In this 3rd issue of the <a href=\"http:\/\/philippeadjiman.com\/blog\/the-hadoop-tutorial-series\/\" target=\"_blank\" rel=\"noreferrer noopener\">hadoop tutorial series<\/a>, we&#8217;ll speak about a very simple but very useful hadoop&#8217;s feature: counters.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Even if you have never defined any counters in hadoop, you can see some of them each time you are running an hadoop job. Indeed, here is what you can see from the client console at the end of the execution of a job (can also be seen from the web interface):<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img data-recalc-dims=\"1\" decoding=\"async\" width=\"613\" height=\"461\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2010\/01\/counters-1.jpg?resize=613%2C461&#038;ssl=1\" alt=\"\" class=\"wp-image-1913\" srcset=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2010\/01\/counters-1.jpg?w=613&amp;ssl=1 613w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2010\/01\/counters-1.jpg?resize=300%2C226&amp;ssl=1 300w\" sizes=\"auto, (max-width: 613px) 100vw, 613px\" \/><figcaption class=\"wp-element-caption\">Hadoop internal counters at the end of a job <\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">As you can see, 18 internal counters are presented inside different groups. For instance, you can see a section &#8220;Job Counters&#8221; with three different counters giving basic information about the job like the number of mappers and reducers. In that example, &#8220;Job Counters&#8221; is called the <em>group<\/em> of the counter and &#8220;Launched reduce tasks&#8221; (for instance) is properly the <em>name<\/em> of the counter.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">It is very handy to define your own counters to track any kind of statistics about the records you are manipulating in the mapper and the reducer. The most natural use of that is to use counters to track the number of malformed records.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">If you are executing a job&nbsp; and you see an abnormally high number of malformed records, it can give a good hint that you perhaps have a bug in your code or some problem with your data (note this is actually a much simpler way to spot issues than tracking error messages in a distributed set of log files). But you can actually use counters for any kind of other statistics on your records.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">One easy way to define your own counters from your Java code is:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"\">Declaring an enum representing your counters. The enum name is the group of the counter, and each field of the enum is the name of the counter that will be reported in this same group<\/li>\n\n\n\n<li class=\"\">Incrementing the desired counters from your map and reduce methods through the Context of your mapper or reducer (in previous hadoop version it was through the Reporter.incrCounter() method, but the reporter no longer exists in hadoop 0.20)<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">So let&#8217;s see an example. We&#8217;ll take the word count example <a href=\"http:\/\/cxwangyi.blogspot.com\/2009\/12\/wordcount-tutorial-for-hadoop-0201.html\" target=\"_blank\" rel=\"noreferrer noopener\">revised for version 0.20.1<\/a> to illustrate the use of counters. We will create a counter group called WordsNature that will count how many unique tokens there is in all, how many unique tokens starts with a digit and how many unique tokens starts with a letter.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">So our enum declaration will look like that:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\"> static enum WordsNature { STARTS_WITH_DIGIT, STARTS_WITH_LETTER, ALL }<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">We will also need a very basic StringUtils class:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">package com.philippeadjiman.hadooptraining;\n\npublic class StringUtils {\n\n\tpublic static boolean startsWithDigit(String s){\n\t\tif( s == null || s.length() == 0 )\n\t\t\treturn false;\n\n\t\treturn Character.isDigit(s.charAt(0));\n\t}\n\n\tpublic static boolean startsWithLetter(String s){\n\t\tif( s == null || s.length() == 0 )\n\t\t\treturn false;\n\n\t\treturn Character.isLetter(s.charAt(0));\n\t}\n\n}<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Since we are interested in unique tokens, we will put the code related with the counter into the reduce method. So here how the reduce method will look like:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">public void reduce(Text key, Iterable values, Context context)\n\tthrows IOException, InterruptedException {\n\n\tint sum = 0;\n\tString token = key.toString();\n\tif( StringUtils.startsWithDigit(token) ){\n\t\tcontext.getCounter(WordsNature.STARTS_WITH_DIGIT).increment(1);\n\t}\n\telse if( StringUtils.startsWithLetter(token) ){\n\t\tcontext.getCounter(WordsNature.STARTS_WITH_LETTER).increment(1);\n\t}\n\tcontext.getCounter(WordsNature.ALL).increment(1);\n\tfor (IntWritable value : values) {\n\t\tsum += value.get();\n\t}\n\tcontext.write(key, new IntWritable(sum));\n}<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Here is the code of the <a href=\"http:\/\/philippeadjiman.com\/blog\/wp-content\/uploads\/2010\/01\/WordCountWithCounter.java\">WordCountWithCounter<\/a> that include this code.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">If you want to run it inside our <a href=\"http:\/\/philippeadjiman.com\/blog\/2009\/12\/07\/hadoop-tutorial-part-1-setting-up-your-mapreduce-learning-playground\/\" target=\"_blank\" rel=\"noreferrer noopener\">learning playground<\/a> you&#8217;ll just have to update the pom with hadoop latest version:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><em>&lt;dependency&gt;<br>&lt;groupId&gt;org.apache.mahout.hadoop&lt;\/groupId&gt;<br>&lt;artifactId&gt;hadoop-core&lt;\/artifactId&gt;<br>&lt;version&gt;0.20.1&lt;\/version&gt;<br>&lt;\/dependency&gt;<\/em><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">So here is the result after running the code with, as input, the <a href=\"https:\/\/gist.github.com\/StevenClontz\/4445774\" target=\"_blank\" rel=\"noreferrer noopener\">whole text of moby dick<\/a>:<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img data-recalc-dims=\"1\" decoding=\"async\" width=\"827\" height=\"528\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2010\/01\/jobResultsWithCounters.jpg?resize=827%2C528&#038;ssl=1\" alt=\"\" class=\"wp-image-1914\" srcset=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2010\/01\/jobResultsWithCounters.jpg?w=827&amp;ssl=1 827w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2010\/01\/jobResultsWithCounters.jpg?resize=300%2C192&amp;ssl=1 300w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2010\/01\/jobResultsWithCounters.jpg?resize=768%2C490&amp;ssl=1 768w\" sizes=\"auto, (max-width: 827px) 100vw, 827px\" \/><figcaption class=\"wp-element-caption\">We can now see our home made counters. <\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">So we can see now that we have 33783 unique tokens, 32511 starting with a letter and 263 starting with a digit. What about the 1009 others?? Well, because the word count example use a basic StringTokenizer that splits tokens at spaces, a lot of words simply starts with a &#8216;(&#8216; or with a &#8216;[&#8216; and even with &#8216;&#8211;&#8216;. To solve that you can for instance use a lucene <a href=\"http:\/\/lucene.apache.org\/java\/2_2_0\/api\/org\/apache\/lucene\/analysis\/standard\/StandardAnalyzer.html\" target=\"_blank\" rel=\"noreferrer noopener\">StandardAnalyzer<\/a>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">You should now be able to easily implements your own counters for tracking bad records\/missing values, debugging or gathering any kind of statistics around your job.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">See you soon for another issue&#8230;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Shows how to instrument MapReduce jobs with Hadoop Counters to track custom metrics during large\u2011scale processing. <\/p>\n","protected":false},"author":1,"featured_media":1917,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"nf_dc_page":"","_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[10,17],"tags":[26,37],"class_list":["post-818","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-hadoop","category-tutorial","tag-hadoop","tag-tutorial"],"jetpack_featured_media_url":"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2010\/01\/ler0131_trans_cntrs_3_sh_web.jpg?fit=700%2C700&ssl=1","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/posts\/818","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/comments?post=818"}],"version-history":[{"count":3,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/posts\/818\/revisions"}],"predecessor-version":[{"id":1962,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/posts\/818\/revisions\/1962"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/media\/1917"}],"wp:attachment":[{"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/media?parent=818"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/categories?post=818"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/tags?post=818"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}