{"id":488,"date":"2009-10-26T05:15:25","date_gmt":"2009-10-26T10:15:25","guid":{"rendered":"http:\/\/philippeadjiman.com\/blog\/?p=488"},"modified":"2025-07-19T19:23:07","modified_gmt":"2025-07-19T19:23:07","slug":"drawing-the-long-tail-of-a-zipf-law-using-gnuplot-java-and-moby-dick","status":"publish","type":"post","link":"https:\/\/philippeadjiman.com\/blog\/2009\/10\/26\/drawing-the-long-tail-of-a-zipf-law-using-gnuplot-java-and-moby-dick\/","title":{"rendered":"Drawing A Zipf Law Using Gnuplot, Java and Moby-Dick"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">There are many tools out there to build more or less quickly any kind of graphs. Depending on your needs a tool may be more suited than another. When it comes to draw graphs from a set of generated coordinates, I love the simplicity of <a href=\"http:\/\/www.gnuplot.info\/\" target=\"_blank\" rel=\"noreferrer noopener\">gnuplot<\/a>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Let&#8217;s see together a simple example that explain how to draw a <a href=\"http:\/\/en.wikipedia.org\/wiki\/Zipf%27s_law\" target=\"_blank\" rel=\"noreferrer noopener\">zipf law<\/a> observed on a long english text.<br>If you&#8217;re not familiar with zipf law, simply put it states that the product of the rank (R) of a word and its frequency (F) is roughly constant. This law is also know under the name &#8220;principle of the least effort&#8221; because people tends to use the same words often and rarely use new or different words.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Step 1 : Install gnuplot<\/strong><\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">For mac, <a href=\"http:\/\/lee-phillips.org\/info\/Macintosh\/gnuplot.html\" target=\"_blank\" rel=\"noreferrer noopener\">check this<\/a>.<br>For linux, depending on your distrib it should be as simple as an apt-get install (for ubuntu you can check <a href=\"http:\/\/www.miscdebris.net\/blog\/2007\/04\/27\/install-gnuplot-on-ubuntu\/\" target=\"_blank\" rel=\"noreferrer noopener\">this howto<\/a>).<br>For windows you can either go the &#8220;hard&#8221; way with cygwin + X11 (see Part 1,4 and 5 of <a href=\"http:\/\/www2.warwick.ac.uk\/fac\/sci\/moac\/currentstudents\/peter_cock\/cygwin\/part1\/\" target=\"_blank\" rel=\"noreferrer noopener\">those instructions<\/a>) or the easy way by clicking on pgnuplot.exe located in the gpXXXwin32.zip located <a href=\"http:\/\/sourceforge.net\/projects\/gnuplot\/files\/\" target=\"_blank\" rel=\"noreferrer noopener\">here<\/a> (this last solution may be also easier if you want to have copy\/paste between the gnuplot terminal and other windows).<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Step 2: Generate the Zipf Law data using Java and Moby Dick!<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">As I told you above, gnuplot is particularly simple for drawing a set of generated coordinates. All you have to do is to generated a file containing on each line a couple of coordinates.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For the sake of the example, I will use the <a href=\"http:\/\/www.gutenberg.org\/etext\/2701\" target=\"_blank\" rel=\"noreferrer noopener\">full raw text of Moby Dick<\/a> to generate the points. The goal is to generate a list of points of the form x y where x represents the rank of the word (the more frequent the word is, the higher its rank) and y represents its number of occurrences.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Find below the java code I used to do that. If you want to execute it, you will need <a href=\"http:\/\/lucene.apache.org\/\" target=\"_blank\" rel=\"noreferrer noopener\">lucene <\/a>and the <a href=\"http:\/\/code.google.com\/p\/google-collections\/\" target=\"_blank\" rel=\"noreferrer noopener\">google collections<\/a> (soon to become part of <a href=\"http:\/\/code.google.com\/p\/guava-libraries\/\" target=\"_blank\" rel=\"noreferrer noopener\">guava<\/a>) libraries.<\/p>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-js\" data-lang=\"JavaScript\"><code>import java.io.File;\nimport java.io.FileReader;\nimport java.io.IOException;\nimport java.util.ArrayList;\nimport java.util.Collections;\nimport java.util.Comparator;\nimport java.util.List;\n\nimport org.apache.lucene.analysis.Token;\nimport org.apache.lucene.analysis.TokenStream;\nimport org.apache.lucene.analysis.standard.StandardAnalyzer;\n\nimport com.google.common.collect.HashMultiset;\nimport com.google.common.collect.Multiset;\nimport com.google.common.collect.Multiset.Entry;\n\npublic class ZipfLawOnMobyDick {\n\tpublic static void main(String[] args) throws IOException {\n\n\t\t\/\/Multiset for storing word occurrences\n\t\tMultiset multiset = HashMultiset.create();\n\n\t\t\/\/Creating a standard analyzer with no stop words (we need them to observe the zipf law)\n\t\tString[] STOP_WORDS = {};\n\t\tStandardAnalyzer analyzer = new StandardAnalyzer(STOP_WORDS);\n\n\t\t\/\/Initializing the multiset by parsing the whole content of Moby Dick\n\t\tTokenStream stream = analyzer.tokenStream(&quot;content&quot;, new FileReader(new File(&quot;C:moby_dick.txt&quot;)));\n\t\tToken token = new Token();\n\t\twhile ((token = stream.next(token)) != null){\n\t\t\tmultiset.add(token.term());\n\t\t}\n\n\t\t\/\/Sorting the multiset by number of occurrences using a comparator on the Entries of the multiset\n\t\tList&gt; l = new ArrayList&gt;(multiset.entrySet());\n\t\tComparator&gt; occurence_comparator = new Comparator&gt;() {\n\t\t\tpublic int compare(Multiset.Entry e1, Multiset.Entry e2) { \t\t\t\treturn e2.getCount() - e1.getCount() ;\n\t\t\t}\n\t\t};\n\t\tCollections.sort(l,occurence_comparator);\n\n\t\tint rank = 1;\n\t\tfor( Multiset.Entry e : l ){\n\t\t\tSystem.out.println(rank+&quot;t&quot;+e.getCount());\n\t\t\trank++;\n\t\t}\n\t}\n}<\/code><\/pre><\/div>\n\n\n\n<pre class=\"wp-block-preformatted\"><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">This will generate the following <a href=\"http:\/\/www.philippeadjiman.com\/blog\/wp-content\/uploads\/2009\/10\/zipfModyDick1.txt\" target=\"_blank\" rel=\"noreferrer noopener\">output <\/a>(the set of coordinates) that you can put in a file called moby_dick.gp. If you&#8217;re curious about what are the 100 hottest keywords of the whole text you can check them <a href=\"http:\/\/www.philippeadjiman.com\/blog\/wp-content\/uploads\/2009\/10\/mobyDickTop100WordOccurrences1.txt\" target=\"_blank\" rel=\"noreferrer noopener\">here<\/a>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Step 3: Drawing using gnuplot<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">What you can do first is simply to type the following command in the gnuplot console (you have to be on the same directory as the moby_dick.gp file):<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">plot [0:500][0:16000] \"moby_dick.gp\"<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">It simply draws the points and rescale the range of x and y respectively to [0:500] and [0:16000] so we can see something.<br>Play with the ranges to see the differences.<br>If you want the dots to be connected, just type:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">plot [0:500][0:16000] \"moby_dick.gp\" with lines<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">If you want to add some legends, you can put some labels and arrows.<br>Here is an example of a gnuplot script that will set some information on the graph (you can simply copy\/paste it in the gnuplot console):<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">set xlabel \"word rank\"\nset ylabel \"# of occurrences\"\nset label 1 \"the word ranked #14 occurs 1753 times\" at 70,4000\nset arrow 1 from 65,3750 to 15,1800\nplot [0:500][0:16000] \"moby_dick.gp<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">As you can see it is pretty straightforward. You can play with the coordinates to adjust where to put the labels and arrow.<br>You will obtain this graph :<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img data-recalc-dims=\"1\" decoding=\"async\" width=\"300\" height=\"225\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2009\/10\/moby_dick-300x225-1.png?resize=300%2C225&#038;ssl=1\" alt=\"\" class=\"wp-image-1988\" style=\"width:444px;height:auto\"\/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">To export it as a png file just type:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">set terminal png\nset output \"moby_dick.png\"\nplot [0:500][0:16000] \"moby_dick.gp\"<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">You also might want to try a log scale on the vertical axis as to not waste the majority of the graph&#8217;s scale (thanks Bob for the remark).<br>To do so, you can simply type in the gnuplot console:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">set logscale y<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">by plotting within the range [1:3000][5:10000], you&#8217;ll obtain:<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img data-recalc-dims=\"1\" decoding=\"async\" width=\"300\" height=\"225\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2009\/10\/moby_dick_semilog-300x225-1.png?resize=300%2C225&#038;ssl=1\" alt=\"\" class=\"wp-image-1989\" style=\"width:361px;height:auto\"\/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Finally, you might want to use a log-log scale that are traditionally used to observe such power laws. Just set the logscale for x as you did for y and you&#8217;ll obtain:<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full is-resized\"><img data-recalc-dims=\"1\" decoding=\"async\" width=\"300\" height=\"225\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2009\/10\/moby_dick_loglog-300x225-1.png?resize=300%2C225&#038;ssl=1\" alt=\"\" class=\"wp-image-1990\" style=\"width:372px;height:auto\"\/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">You can of course add as much eye candies as you want (the <a href=\"http:\/\/www.gnuplot.info\/screenshots\/index.html#demos\" target=\"_blank\" rel=\"noreferrer noopener\">demo page of the gnuplot website<\/a> gives tons of example).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Also, there are probably dozens of ways to draw the same thing, I just loved the fun and simplicity of that one.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Let&#8217;s use the Moby\u2011Dick text to generate word frequency plots and illustrate Zipf\u2019s law programmatically.<\/p>\n","protected":false},"author":1,"featured_media":1992,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"nf_dc_page":"","_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[8,11,17],"tags":[23,24,27,37],"class_list":["post-488","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-experiments","category-java","category-tutorial","tag-experiments","tag-gnuplot","tag-java","tag-tutorial"],"jetpack_featured_media_url":"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2009\/10\/P_Marine_Mammals.png?fit=627%2C525&ssl=1","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/posts\/488","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/comments?post=488"}],"version-history":[{"count":3,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/posts\/488\/revisions"}],"predecessor-version":[{"id":1993,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/posts\/488\/revisions\/1993"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/media\/1992"}],"wp:attachment":[{"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/media?parent=488"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/categories?post=488"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/tags?post=488"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}