<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Java. Internet. Algorithms. Ideas. &#187; gnuplot</title>
	<atom:link href="http://philippeadjiman.com/blog/tag/gnuplot/feed/" rel="self" type="application/rss+xml" />
	<link>http://philippeadjiman.com/blog</link>
	<description>Just Another Blog About Geek Stuff, by Philippe Adjiman</description>
	<lastBuildDate>Tue, 25 May 2010 06:58:19 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Drawing A Zipf Law Using Gnuplot, Java and Moby-Dick</title>
		<link>http://philippeadjiman.com/blog/2009/10/26/drawing-the-long-tail-of-a-zipf-law-using-gnuplot-java-and-moby-dick/</link>
		<comments>http://philippeadjiman.com/blog/2009/10/26/drawing-the-long-tail-of-a-zipf-law-using-gnuplot-java-and-moby-dick/#comments</comments>
		<pubDate>Mon, 26 Oct 2009 10:15:25 +0000</pubDate>
		<dc:creator>padjiman</dc:creator>
				<category><![CDATA[experiments]]></category>
		<category><![CDATA[java]]></category>
		<category><![CDATA[tutorial]]></category>
		<category><![CDATA[gnuplot]]></category>

		<guid isPermaLink="false">http://philippeadjiman.com/blog/?p=488</guid>
		<description><![CDATA[There are many tools out there to build more or less quickly any kind of graphs. Depending on your needs a tool may be more suited than another. When it comes to draw graphs from a set of generated coordinates, I love the simplicity of gnuplot.
Let&#8217;s see together a simple example that explain how to [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://en.wikipedia.org/wiki/Moby-Dick" target="_blank"><img title="whale" src="http://philippeadjiman.com/blog/wp-content/uploads/2009/10/whale-300x214.jpg" alt="whale" hspace="15" width="300" height="214" align="left" /></a>There are many tools out there to build more or less quickly any kind of graphs. Depending on your needs a tool may be more suited than another. When it comes to draw graphs from a set of generated coordinates, I love the simplicity of <a href="http://www.gnuplot.info/" target="_blank">gnuplot</a>.</p>
<p>Let&#8217;s see together a simple example that explain how to draw a <a href="http://en.wikipedia.org/wiki/Zipf%27s_law" target="_blank">zipf law</a> observed on a long english text.<br />
If you&#8217;re not familiar with zipf law, simply put it states that the product of the rank (R) of a word and its frequency (F) is roughly constant. This law is also know under the name &#8220;principle of the least effort&#8221; because people tends to use the same words often and rarely use new or different words.</p>
<h2><strong>Step 1 : Install gnuplot</strong></h2>
<p>For mac, <a href="http://lee-phillips.org/info/Macintosh/gnuplot.html" target="_blank">check this</a>.<br />
For linux, depending on your distrib it should be as simple as an apt-get install (for ubuntu you can check <a href="http://www.miscdebris.net/blog/2007/04/27/install-gnuplot-on-ubuntu/" target="_blank">this howto</a>).<br />
For windows you can either go the &#8220;hard&#8221; way with cygwin + X11 (see Part 1,4 and 5 of <a href="http://www2.warwick.ac.uk/fac/sci/moac/currentstudents/peter_cock/cygwin/part1/" target="_blank">those instructions</a>) or the easy way by clicking on pgnuplot.exe located in the gpXXXwin32.zip located <a href="http://sourceforge.net/projects/gnuplot/files/" target="_blank">here</a> (this last solution may be also easier if you want to have copy/paste between the gnuplot terminal and other windows).</p>
<h2>Step 2: Generate the Zipf Law data using Java and Moby Dick!</h2>
<p>As I told you above, gnuplot is particularly simple for drawing a set of generated coordinates. All you have to do is to generated a file containing on each line a couple of coordinates.</p>
<p>For the sake of the example, I will use the <a href="http://www.gutenberg.org/etext/2701" target="_blank">full raw text of Moby Dick</a> to generate the points. The goal is to generate a list of points of the form x y where x represents the rank of the word (the more frequent the word is, the higher its rank) and y represents its number of occurrences.</p>
<p>Find below the java code I used to do that. If you want to execute it, you will need <a href="http://lucene.apache.org/" target="_blank">lucene </a>and the <a href="http://code.google.com/p/google-collections/" target="_blank">google collections</a> (soon to become part of <a href="http://code.google.com/p/guava-libraries/" target="_blank">guava</a>) libraries.</p>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;"><span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">java.io.File</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">java.io.FileReader</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">java.io.IOException</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">java.util.ArrayList</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">java.util.Collections</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">java.util.Comparator</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">java.util.List</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.lucene.analysis.Token</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.lucene.analysis.TokenStream</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.lucene.analysis.standard.StandardAnalyzer</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">com.google.common.collect.HashMultiset</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">com.google.common.collect.Multiset</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">com.google.common.collect.Multiset.Entry</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000000; font-weight: bold;">class</span> ZipfLawOnMobyDick <span style="color: #009900;">&#123;</span>
	<span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000000; font-weight: bold;">static</span> <span style="color: #000066; font-weight: bold;">void</span> main<span style="color: #009900;">&#40;</span><span style="color: #003399;">String</span><span style="color: #009900;">&#91;</span><span style="color: #009900;">&#93;</span> args<span style="color: #009900;">&#41;</span> <span style="color: #000000; font-weight: bold;">throws</span> <span style="color: #003399;">IOException</span> <span style="color: #009900;">&#123;</span>
&nbsp;
		<span style="color: #666666; font-style: italic;">//Multiset for storing word occurrences</span>
		Multiset multiset <span style="color: #339933;">=</span> HashMultiset.<span style="color: #006633;">create</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
		<span style="color: #666666; font-style: italic;">//Creating a standard analyzer with no stop words (we need them to observe the zipf law)</span>
		<span style="color: #003399;">String</span><span style="color: #009900;">&#91;</span><span style="color: #009900;">&#93;</span> STOP_WORDS <span style="color: #339933;">=</span> <span style="color: #009900;">&#123;</span><span style="color: #009900;">&#125;</span><span style="color: #339933;">;</span>
		StandardAnalyzer analyzer <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> StandardAnalyzer<span style="color: #009900;">&#40;</span>STOP_WORDS<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
		<span style="color: #666666; font-style: italic;">//Initializing the multiset by parsing the whole content of Moby Dick</span>
		TokenStream stream <span style="color: #339933;">=</span> analyzer.<span style="color: #006633;">tokenStream</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;content&quot;</span>, <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">FileReader</span><span style="color: #009900;">&#40;</span><span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">File</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;C:<span style="color: #000099; font-weight: bold;">\\</span>moby_dick.txt&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
		Token token <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> Token<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
		<span style="color: #000000; font-weight: bold;">while</span> <span style="color: #009900;">&#40;</span><span style="color: #009900;">&#40;</span>token <span style="color: #339933;">=</span> stream.<span style="color: #006633;">next</span><span style="color: #009900;">&#40;</span>token<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">!=</span> <span style="color: #000066; font-weight: bold;">null</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#123;</span>
			multiset.<span style="color: #006633;">add</span><span style="color: #009900;">&#40;</span>token.<span style="color: #006633;">term</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
		<span style="color: #009900;">&#125;</span>
&nbsp;
		<span style="color: #666666; font-style: italic;">//Sorting the multiset by number of occurrences using a comparator on the Entries of the multiset</span>
		List<span style="color: #339933;">&amp;</span>gt<span style="color: #339933;">;</span> l <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> ArrayList<span style="color: #339933;">&amp;</span>gt<span style="color: #339933;">;</span><span style="color: #009900;">&#40;</span>multiset.<span style="color: #006633;">entrySet</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
		Comparator<span style="color: #339933;">&amp;</span>gt<span style="color: #339933;">;</span> occurence_comparator <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> Comparator<span style="color: #339933;">&amp;</span>gt<span style="color: #339933;">;</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
			<span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000066; font-weight: bold;">int</span> compare<span style="color: #009900;">&#40;</span>Multiset.<span style="color: #006633;">Entry</span> e1, Multiset.<span style="color: #006633;">Entry</span> e2<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
				<span style="color: #000000; font-weight: bold;">return</span> e2.<span style="color: #006633;">getCount</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">-</span> e1.<span style="color: #006633;">getCount</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">;</span>
			<span style="color: #009900;">&#125;</span>
		<span style="color: #009900;">&#125;</span><span style="color: #339933;">;</span>
		<span style="color: #003399;">Collections</span>.<span style="color: #006633;">sort</span><span style="color: #009900;">&#40;</span>l,occurence_comparator<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
		<span style="color: #000066; font-weight: bold;">int</span> rank <span style="color: #339933;">=</span> <span style="color: #cc66cc;">1</span><span style="color: #339933;">;</span>
		<span style="color: #000000; font-weight: bold;">for</span><span style="color: #009900;">&#40;</span> Multiset.<span style="color: #006633;">Entry</span> e <span style="color: #339933;">:</span> l <span style="color: #009900;">&#41;</span><span style="color: #009900;">&#123;</span>
			<span style="color: #003399;">System</span>.<span style="color: #006633;">out</span>.<span style="color: #006633;">println</span><span style="color: #009900;">&#40;</span>rank<span style="color: #339933;">+</span><span style="color: #0000ff;">&quot;<span style="color: #000099; font-weight: bold;">\t</span>&quot;</span><span style="color: #339933;">+</span>e.<span style="color: #006633;">getCount</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
			rank<span style="color: #339933;">++;</span>
		<span style="color: #009900;">&#125;</span>
	<span style="color: #009900;">&#125;</span>
<span style="color: #009900;">&#125;</span></pre></div></div>

<p>This will generate the following <a href="http://philippeadjiman.com/blog/wp-content/uploads/2009/10/zipfModyDick.txt" target="_blank">output </a>(the set of coordinates) that you can put in a file called moby_dick.gp. If you&#8217;re curious about what are the 100 hottest keywords of the whole text you can check them <a href="http://philippeadjiman.com/blog/wp-content/uploads/2009/10/mobyDickTop100WordOccurrences.txt" target="_blank">here</a>.</p>
<h2>Step 3: Drawing using gnuplot</h2>
<p>What you can do first is simply to type the following command in the gnuplot console (you have to be on the same directory as the moby_dick.gp file):</p>

<div class="wp_syntax"><div class="code"><pre class="none" style="font-family:monospace;">plot [0:500][0:16000] &quot;moby_dick.gp&quot;</pre></div></div>

<p>It simply draws the points and rescale the range of x and y respectively to [0:500] and [0:16000] so we can see something.<br />
Play with the ranges to see the differences.<br />
If you want the dots to be connected, just type:</p>

<div class="wp_syntax"><div class="code"><pre class="none" style="font-family:monospace;">plot [0:500][0:16000] &quot;moby_dick.gp&quot; with lines</pre></div></div>

<p>If you want to add some legends, you can put some labels and arrows.<br />
Here is an example of a gnuplot script that will set some information on the graph (you can simply copy/paste it in the gnuplot console):</p>

<div class="wp_syntax"><div class="code"><pre class="none" style="font-family:monospace;">set xlabel &quot;word rank&quot;
set ylabel &quot;# of occurrences&quot;
set label 1 &quot;the word ranked #14 occurs 1753 times&quot; at 70,4000
set arrow 1 from 65,3750 to 15,1800
plot [0:500][0:16000] &quot;moby_dick.gp</pre></div></div>

<p>As you can see it is pretty straightforward. You can play with the coordinates to adjust where to put the labels and arrow.<br />
You will obtain this graph (click to enlarge):</p>
<p><a href="http://philippeadjiman.com/blog/wp-content/uploads/2009/10/moby_dick.png" target="_blank"><img class="aligncenter size-medium wp-image-504" title="moby_dick" src="http://philippeadjiman.com/blog/wp-content/uploads/2009/10/moby_dick-300x225.png" alt="moby_dick" width="300" height="225" /></a></p>
<p>To export it as a png file just type:</p>

<div class="wp_syntax"><div class="code"><pre class="none" style="font-family:monospace;">set terminal png
set output &quot;moby_dick.png&quot;
plot [0:500][0:16000] &quot;moby_dick.gp&quot;</pre></div></div>

<p>You also might want to try a log scale on the vertical axis as to not waste the majority of the graph&#8217;s scale (thanks Bob for the remark).<br />
To do so, you can simply type in the gnuplot console:</p>

<div class="wp_syntax"><div class="code"><pre class="none" style="font-family:monospace;">set logscale y</pre></div></div>

<p>by plotting within the range [1:3000][5:10000], you&#8217;ll obtain:</p>
<p><a href="http://philippeadjiman.com/blog/wp-content/uploads/2009/10/moby_dick_semilog.png" target="_blank"><img class="aligncenter size-medium wp-image-592" title="moby_dick_semilog" src="http://philippeadjiman.com/blog/wp-content/uploads/2009/10/moby_dick_semilog-300x225.png" alt="moby_dick_semilog" width="300" height="225" /></a></p>
<p>Finally, you might want to use a log-log scale that are traditionally used to observe such power laws. Just set the logscale for x as you did for y and you&#8217;ll obtain:</p>
<p><a href="http://philippeadjiman.com/blog/wp-content/uploads/2009/10/moby_dick_loglog.png" target="_blank"><img class="aligncenter size-medium wp-image-593" title="moby_dick_loglog" src="http://philippeadjiman.com/blog/wp-content/uploads/2009/10/moby_dick_loglog-300x225.png" alt="moby_dick_loglog" width="300" height="225" /></a></p>
<p>You can of course add as much eye candies as you want (the <a href="http://www.gnuplot.info/screenshots/index.html#demos" target="_blank">demo page of the gnuplot website</a> gives tons of example).</p>
<p>Also, there are probably dozens of ways to draw the same thing, I just loved the fun and simplicity of that one.</p>
]]></content:encoded>
			<wfw:commentRss>http://philippeadjiman.com/blog/2009/10/26/drawing-the-long-tail-of-a-zipf-law-using-gnuplot-java-and-moby-dick/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>
