<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Java. Internet. Algorithms. Ideas. &#187; java</title>
	<atom:link href="http://philippeadjiman.com/blog/tag/java/feed/" rel="self" type="application/rss+xml" />
	<link>http://philippeadjiman.com/blog</link>
	<description>Just Another Blog About Geek Stuff, by Philippe Adjiman</description>
	<lastBuildDate>Tue, 25 May 2010 06:58:19 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Writing A Token N-Grams Analyzer In Few Lines Of Code Using Lucene</title>
		<link>http://philippeadjiman.com/blog/2009/11/02/writing-a-token-n-grams-analyzer-in-few-lines-of-code-using-lucene/</link>
		<comments>http://philippeadjiman.com/blog/2009/11/02/writing-a-token-n-grams-analyzer-in-few-lines-of-code-using-lucene/#comments</comments>
		<pubDate>Mon, 02 Nov 2009 15:54:20 +0000</pubDate>
		<dc:creator>padjiman</dc:creator>
				<category><![CDATA[java]]></category>
		<category><![CDATA[tutorial]]></category>
		<category><![CDATA[lucene]]></category>

		<guid isPermaLink="false">http://philippeadjiman.com/blog/?p=523</guid>
		<description><![CDATA[ If you need to parse the tokens n-grams of a string, you may use the facilities offered by lucene analyzers.
What you simply have to do is to build you own analyzer using a ShingleMatrixFilter with the parameters that suits you needs. For instance, here the few lines of code to build a token bi-grams [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://lucene.apache.org/"><img class="alignleft size-full wp-image-524" title="lucene_green_300" src="http://philippeadjiman.com/blog/wp-content/uploads/2009/11/lucene_green_300.gif" alt="lucene_green_300" hspace="15" width="300" height="46" align="left" /></a> If you need to parse the tokens n-grams of a string, you may use the facilities offered by lucene analyzers.</p>
<p>What you simply have to do is to build you own analyzer using a ShingleMatrixFilter with the parameters that suits you needs. For instance, here the few lines of code to build a token bi-grams analyzer:</p>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;"><span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000000; font-weight: bold;">class</span> NGramAnalyzer <span style="color: #000000; font-weight: bold;">extends</span> Analyzer <span style="color: #009900;">&#123;</span>
	@Override
    <span style="color: #000000; font-weight: bold;">public</span> TokenStream tokenStream<span style="color: #009900;">&#40;</span><span style="color: #003399;">String</span> fieldName, <span style="color: #003399;">Reader</span> reader<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
       <span style="color: #000000; font-weight: bold;">return</span> <span style="color: #000000; font-weight: bold;">new</span> StopFilter<span style="color: #009900;">&#40;</span><span style="color: #000000; font-weight: bold;">new</span> LowerCaseFilter<span style="color: #009900;">&#40;</span><span style="color: #000000; font-weight: bold;">new</span> ShingleMatrixFilter<span style="color: #009900;">&#40;</span><span style="color: #000000; font-weight: bold;">new</span> StandardTokenizer<span style="color: #009900;">&#40;</span>reader<span style="color: #009900;">&#41;</span>,<span style="color: #cc66cc;">2</span>,<span style="color: #cc66cc;">2</span>,<span style="color: #0000ff;">' '</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>,
           StopAnalyzer.<span style="color: #006633;">ENGLISH_STOP_WORDS</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
     <span style="color: #009900;">&#125;</span>
<span style="color: #009900;">&#125;</span></pre></div></div>

<p>The parameters of the ShingleMatrixFilter simply states the minimum and maximum shingle size. &#8220;Shingle&#8221; is just another name for token N-Grams and is popular to be the basic units to help solving problems in spell checking, near-duplicate detection and others.<br />
Note also the use of a StandardTokenizer to deal with basic special characters like hyphens or other &#8220;disturbers&#8221;. </p>
<p>To use the analyzer, you can for instance do:</p>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;">	<span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000000; font-weight: bold;">static</span> <span style="color: #000066; font-weight: bold;">void</span> main<span style="color: #009900;">&#40;</span><span style="color: #003399;">String</span><span style="color: #009900;">&#91;</span><span style="color: #009900;">&#93;</span> args<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
		<span style="color: #000000; font-weight: bold;">try</span> <span style="color: #009900;">&#123;</span>
			<span style="color: #003399;">String</span> str <span style="color: #339933;">=</span> <span style="color: #0000ff;">&quot;An easy way to write an analyzer for tokens bi-gram (or even tokens n-grams) with lucene&quot;</span><span style="color: #339933;">;</span>
			Analyzer analyzer <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> NGramAnalyzer<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
			TokenStream stream <span style="color: #339933;">=</span> analyzer.<span style="color: #006633;">tokenStream</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;content&quot;</span>, <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">StringReader</span><span style="color: #009900;">&#40;</span>str<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
			Token token <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> Token<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
			<span style="color: #000000; font-weight: bold;">while</span> <span style="color: #009900;">&#40;</span><span style="color: #009900;">&#40;</span>token <span style="color: #339933;">=</span> stream.<span style="color: #006633;">next</span><span style="color: #009900;">&#40;</span>token<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">!=</span> <span style="color: #000066; font-weight: bold;">null</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#123;</span>
				<span style="color: #003399;">System</span>.<span style="color: #006633;">out</span>.<span style="color: #006633;">println</span><span style="color: #009900;">&#40;</span>token.<span style="color: #006633;">term</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
			<span style="color: #009900;">&#125;</span>
&nbsp;
		<span style="color: #009900;">&#125;</span> <span style="color: #000000; font-weight: bold;">catch</span> <span style="color: #009900;">&#40;</span><span style="color: #003399;">IOException</span> ie<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
			<span style="color: #003399;">System</span>.<span style="color: #006633;">out</span>.<span style="color: #006633;">println</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;IO Error &quot;</span> <span style="color: #339933;">+</span> ie.<span style="color: #006633;">getMessage</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
		<span style="color: #009900;">&#125;</span>
	<span style="color: #009900;">&#125;</span></pre></div></div>

<p>The output will print:</p>

<div class="wp_syntax"><div class="code"><pre class="none" style="font-family:monospace;">an easy
easy way
way to
to write
write an
an analyzer
analyzer for
for tokens
tokens bi
bi gram
gram or
or even
even tokens
tokens n
n grams
grams with
with lucene</pre></div></div>

<p>Note that the text &#8220;bi-gram&#8221; was treated like two different tokens, as a desired consequence of using a StandardTokenizer in the ShingleMatrixFilter initialization.</p>
]]></content:encoded>
			<wfw:commentRss>http://philippeadjiman.com/blog/2009/11/02/writing-a-token-n-grams-analyzer-in-few-lines-of-code-using-lucene/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Drawing A Zipf Law Using Gnuplot, Java and Moby-Dick</title>
		<link>http://philippeadjiman.com/blog/2009/10/26/drawing-the-long-tail-of-a-zipf-law-using-gnuplot-java-and-moby-dick/</link>
		<comments>http://philippeadjiman.com/blog/2009/10/26/drawing-the-long-tail-of-a-zipf-law-using-gnuplot-java-and-moby-dick/#comments</comments>
		<pubDate>Mon, 26 Oct 2009 10:15:25 +0000</pubDate>
		<dc:creator>padjiman</dc:creator>
				<category><![CDATA[experiments]]></category>
		<category><![CDATA[java]]></category>
		<category><![CDATA[tutorial]]></category>
		<category><![CDATA[gnuplot]]></category>

		<guid isPermaLink="false">http://philippeadjiman.com/blog/?p=488</guid>
		<description><![CDATA[There are many tools out there to build more or less quickly any kind of graphs. Depending on your needs a tool may be more suited than another. When it comes to draw graphs from a set of generated coordinates, I love the simplicity of gnuplot.
Let&#8217;s see together a simple example that explain how to [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://en.wikipedia.org/wiki/Moby-Dick" target="_blank"><img title="whale" src="http://philippeadjiman.com/blog/wp-content/uploads/2009/10/whale-300x214.jpg" alt="whale" hspace="15" width="300" height="214" align="left" /></a>There are many tools out there to build more or less quickly any kind of graphs. Depending on your needs a tool may be more suited than another. When it comes to draw graphs from a set of generated coordinates, I love the simplicity of <a href="http://www.gnuplot.info/" target="_blank">gnuplot</a>.</p>
<p>Let&#8217;s see together a simple example that explain how to draw a <a href="http://en.wikipedia.org/wiki/Zipf%27s_law" target="_blank">zipf law</a> observed on a long english text.<br />
If you&#8217;re not familiar with zipf law, simply put it states that the product of the rank (R) of a word and its frequency (F) is roughly constant. This law is also know under the name &#8220;principle of the least effort&#8221; because people tends to use the same words often and rarely use new or different words.</p>
<h2><strong>Step 1 : Install gnuplot</strong></h2>
<p>For mac, <a href="http://lee-phillips.org/info/Macintosh/gnuplot.html" target="_blank">check this</a>.<br />
For linux, depending on your distrib it should be as simple as an apt-get install (for ubuntu you can check <a href="http://www.miscdebris.net/blog/2007/04/27/install-gnuplot-on-ubuntu/" target="_blank">this howto</a>).<br />
For windows you can either go the &#8220;hard&#8221; way with cygwin + X11 (see Part 1,4 and 5 of <a href="http://www2.warwick.ac.uk/fac/sci/moac/currentstudents/peter_cock/cygwin/part1/" target="_blank">those instructions</a>) or the easy way by clicking on pgnuplot.exe located in the gpXXXwin32.zip located <a href="http://sourceforge.net/projects/gnuplot/files/" target="_blank">here</a> (this last solution may be also easier if you want to have copy/paste between the gnuplot terminal and other windows).</p>
<h2>Step 2: Generate the Zipf Law data using Java and Moby Dick!</h2>
<p>As I told you above, gnuplot is particularly simple for drawing a set of generated coordinates. All you have to do is to generated a file containing on each line a couple of coordinates.</p>
<p>For the sake of the example, I will use the <a href="http://www.gutenberg.org/etext/2701" target="_blank">full raw text of Moby Dick</a> to generate the points. The goal is to generate a list of points of the form x y where x represents the rank of the word (the more frequent the word is, the higher its rank) and y represents its number of occurrences.</p>
<p>Find below the java code I used to do that. If you want to execute it, you will need <a href="http://lucene.apache.org/" target="_blank">lucene </a>and the <a href="http://code.google.com/p/google-collections/" target="_blank">google collections</a> (soon to become part of <a href="http://code.google.com/p/guava-libraries/" target="_blank">guava</a>) libraries.</p>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;"><span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">java.io.File</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">java.io.FileReader</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">java.io.IOException</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">java.util.ArrayList</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">java.util.Collections</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">java.util.Comparator</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">java.util.List</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.lucene.analysis.Token</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.lucene.analysis.TokenStream</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">org.apache.lucene.analysis.standard.StandardAnalyzer</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">com.google.common.collect.HashMultiset</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">com.google.common.collect.Multiset</span><span style="color: #339933;">;</span>
<span style="color: #000000; font-weight: bold;">import</span> <span style="color: #006699;">com.google.common.collect.Multiset.Entry</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000000; font-weight: bold;">class</span> ZipfLawOnMobyDick <span style="color: #009900;">&#123;</span>
	<span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000000; font-weight: bold;">static</span> <span style="color: #000066; font-weight: bold;">void</span> main<span style="color: #009900;">&#40;</span><span style="color: #003399;">String</span><span style="color: #009900;">&#91;</span><span style="color: #009900;">&#93;</span> args<span style="color: #009900;">&#41;</span> <span style="color: #000000; font-weight: bold;">throws</span> <span style="color: #003399;">IOException</span> <span style="color: #009900;">&#123;</span>
&nbsp;
		<span style="color: #666666; font-style: italic;">//Multiset for storing word occurrences</span>
		Multiset multiset <span style="color: #339933;">=</span> HashMultiset.<span style="color: #006633;">create</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
		<span style="color: #666666; font-style: italic;">//Creating a standard analyzer with no stop words (we need them to observe the zipf law)</span>
		<span style="color: #003399;">String</span><span style="color: #009900;">&#91;</span><span style="color: #009900;">&#93;</span> STOP_WORDS <span style="color: #339933;">=</span> <span style="color: #009900;">&#123;</span><span style="color: #009900;">&#125;</span><span style="color: #339933;">;</span>
		StandardAnalyzer analyzer <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> StandardAnalyzer<span style="color: #009900;">&#40;</span>STOP_WORDS<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
		<span style="color: #666666; font-style: italic;">//Initializing the multiset by parsing the whole content of Moby Dick</span>
		TokenStream stream <span style="color: #339933;">=</span> analyzer.<span style="color: #006633;">tokenStream</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;content&quot;</span>, <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">FileReader</span><span style="color: #009900;">&#40;</span><span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">File</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;C:<span style="color: #000099; font-weight: bold;">\\</span>moby_dick.txt&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
		Token token <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> Token<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
		<span style="color: #000000; font-weight: bold;">while</span> <span style="color: #009900;">&#40;</span><span style="color: #009900;">&#40;</span>token <span style="color: #339933;">=</span> stream.<span style="color: #006633;">next</span><span style="color: #009900;">&#40;</span>token<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">!=</span> <span style="color: #000066; font-weight: bold;">null</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#123;</span>
			multiset.<span style="color: #006633;">add</span><span style="color: #009900;">&#40;</span>token.<span style="color: #006633;">term</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
		<span style="color: #009900;">&#125;</span>
&nbsp;
		<span style="color: #666666; font-style: italic;">//Sorting the multiset by number of occurrences using a comparator on the Entries of the multiset</span>
		List<span style="color: #339933;">&amp;</span>gt<span style="color: #339933;">;</span> l <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> ArrayList<span style="color: #339933;">&amp;</span>gt<span style="color: #339933;">;</span><span style="color: #009900;">&#40;</span>multiset.<span style="color: #006633;">entrySet</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
		Comparator<span style="color: #339933;">&amp;</span>gt<span style="color: #339933;">;</span> occurence_comparator <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> Comparator<span style="color: #339933;">&amp;</span>gt<span style="color: #339933;">;</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
			<span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000066; font-weight: bold;">int</span> compare<span style="color: #009900;">&#40;</span>Multiset.<span style="color: #006633;">Entry</span> e1, Multiset.<span style="color: #006633;">Entry</span> e2<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
				<span style="color: #000000; font-weight: bold;">return</span> e2.<span style="color: #006633;">getCount</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">-</span> e1.<span style="color: #006633;">getCount</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">;</span>
			<span style="color: #009900;">&#125;</span>
		<span style="color: #009900;">&#125;</span><span style="color: #339933;">;</span>
		<span style="color: #003399;">Collections</span>.<span style="color: #006633;">sort</span><span style="color: #009900;">&#40;</span>l,occurence_comparator<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
		<span style="color: #000066; font-weight: bold;">int</span> rank <span style="color: #339933;">=</span> <span style="color: #cc66cc;">1</span><span style="color: #339933;">;</span>
		<span style="color: #000000; font-weight: bold;">for</span><span style="color: #009900;">&#40;</span> Multiset.<span style="color: #006633;">Entry</span> e <span style="color: #339933;">:</span> l <span style="color: #009900;">&#41;</span><span style="color: #009900;">&#123;</span>
			<span style="color: #003399;">System</span>.<span style="color: #006633;">out</span>.<span style="color: #006633;">println</span><span style="color: #009900;">&#40;</span>rank<span style="color: #339933;">+</span><span style="color: #0000ff;">&quot;<span style="color: #000099; font-weight: bold;">\t</span>&quot;</span><span style="color: #339933;">+</span>e.<span style="color: #006633;">getCount</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
			rank<span style="color: #339933;">++;</span>
		<span style="color: #009900;">&#125;</span>
	<span style="color: #009900;">&#125;</span>
<span style="color: #009900;">&#125;</span></pre></div></div>

<p>This will generate the following <a href="http://philippeadjiman.com/blog/wp-content/uploads/2009/10/zipfModyDick.txt" target="_blank">output </a>(the set of coordinates) that you can put in a file called moby_dick.gp. If you&#8217;re curious about what are the 100 hottest keywords of the whole text you can check them <a href="http://philippeadjiman.com/blog/wp-content/uploads/2009/10/mobyDickTop100WordOccurrences.txt" target="_blank">here</a>.</p>
<h2>Step 3: Drawing using gnuplot</h2>
<p>What you can do first is simply to type the following command in the gnuplot console (you have to be on the same directory as the moby_dick.gp file):</p>

<div class="wp_syntax"><div class="code"><pre class="none" style="font-family:monospace;">plot [0:500][0:16000] &quot;moby_dick.gp&quot;</pre></div></div>

<p>It simply draws the points and rescale the range of x and y respectively to [0:500] and [0:16000] so we can see something.<br />
Play with the ranges to see the differences.<br />
If you want the dots to be connected, just type:</p>

<div class="wp_syntax"><div class="code"><pre class="none" style="font-family:monospace;">plot [0:500][0:16000] &quot;moby_dick.gp&quot; with lines</pre></div></div>

<p>If you want to add some legends, you can put some labels and arrows.<br />
Here is an example of a gnuplot script that will set some information on the graph (you can simply copy/paste it in the gnuplot console):</p>

<div class="wp_syntax"><div class="code"><pre class="none" style="font-family:monospace;">set xlabel &quot;word rank&quot;
set ylabel &quot;# of occurrences&quot;
set label 1 &quot;the word ranked #14 occurs 1753 times&quot; at 70,4000
set arrow 1 from 65,3750 to 15,1800
plot [0:500][0:16000] &quot;moby_dick.gp</pre></div></div>

<p>As you can see it is pretty straightforward. You can play with the coordinates to adjust where to put the labels and arrow.<br />
You will obtain this graph (click to enlarge):</p>
<p><a href="http://philippeadjiman.com/blog/wp-content/uploads/2009/10/moby_dick.png" target="_blank"><img class="aligncenter size-medium wp-image-504" title="moby_dick" src="http://philippeadjiman.com/blog/wp-content/uploads/2009/10/moby_dick-300x225.png" alt="moby_dick" width="300" height="225" /></a></p>
<p>To export it as a png file just type:</p>

<div class="wp_syntax"><div class="code"><pre class="none" style="font-family:monospace;">set terminal png
set output &quot;moby_dick.png&quot;
plot [0:500][0:16000] &quot;moby_dick.gp&quot;</pre></div></div>

<p>You also might want to try a log scale on the vertical axis as to not waste the majority of the graph&#8217;s scale (thanks Bob for the remark).<br />
To do so, you can simply type in the gnuplot console:</p>

<div class="wp_syntax"><div class="code"><pre class="none" style="font-family:monospace;">set logscale y</pre></div></div>

<p>by plotting within the range [1:3000][5:10000], you&#8217;ll obtain:</p>
<p><a href="http://philippeadjiman.com/blog/wp-content/uploads/2009/10/moby_dick_semilog.png" target="_blank"><img class="aligncenter size-medium wp-image-592" title="moby_dick_semilog" src="http://philippeadjiman.com/blog/wp-content/uploads/2009/10/moby_dick_semilog-300x225.png" alt="moby_dick_semilog" width="300" height="225" /></a></p>
<p>Finally, you might want to use a log-log scale that are traditionally used to observe such power laws. Just set the logscale for x as you did for y and you&#8217;ll obtain:</p>
<p><a href="http://philippeadjiman.com/blog/wp-content/uploads/2009/10/moby_dick_loglog.png" target="_blank"><img class="aligncenter size-medium wp-image-593" title="moby_dick_loglog" src="http://philippeadjiman.com/blog/wp-content/uploads/2009/10/moby_dick_loglog-300x225.png" alt="moby_dick_loglog" width="300" height="225" /></a></p>
<p>You can of course add as much eye candies as you want (the <a href="http://www.gnuplot.info/screenshots/index.html#demos" target="_blank">demo page of the gnuplot website</a> gives tons of example).</p>
<p>Also, there are probably dozens of ways to draw the same thing, I just loved the fun and simplicity of that one.</p>
]]></content:encoded>
			<wfw:commentRss>http://philippeadjiman.com/blog/2009/10/26/drawing-the-long-tail-of-a-zipf-law-using-gnuplot-java-and-moby-dick/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Flexible Java Profiling And Monitoring Using The Netbeans Profiler</title>
		<link>http://philippeadjiman.com/blog/2009/10/20/flexible-java-profiling-using-the-netbeans-profiler/</link>
		<comments>http://philippeadjiman.com/blog/2009/10/20/flexible-java-profiling-using-the-netbeans-profiler/#comments</comments>
		<pubDate>Tue, 20 Oct 2009 13:52:43 +0000</pubDate>
		<dc:creator>padjiman</dc:creator>
				<category><![CDATA[java]]></category>
		<category><![CDATA[tutorial]]></category>
		<category><![CDATA[profiling]]></category>

		<guid isPermaLink="false">http://philippeadjiman.com/blog/?p=433</guid>
		<description><![CDATA[ I have tested a lot of those open source profiler.  My preference goes definitely to the integrated Netbeans profiler. It was simply the easiest and unified solution adapted to all the different settings I ever met, including profiling java applications that (i) were not developed under netbeans (ii) were only in the form [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://philippeadjiman.com/blog/wp-content/uploads/2009/10/ScreenShot085.jpg" target="_blank"><img class="alignleft size-medium wp-image-434" title="cpuProfile" src="http://philippeadjiman.com/blog/wp-content/uploads/2009/10/ScreenShot085-300x196.jpg" alt="cpuProfile" hspace="15" width="300" height="196" align="left" /></a> I have tested a lot of those <a href="http://java-source.net/open-source/profilers" target="_blank">open source profiler</a>.  My preference goes definitely to the integrated Netbeans profiler. It was simply the easiest and unified solution adapted to all the different settings I ever met, including profiling java applications that <strong><em>(i)</em></strong> were not developed under netbeans <strong><em>(ii)</em></strong> were only in the form of standalone jar <em><strong>(iii)</strong></em> were running on a remote Linux machine for which no X server were running (i.e. no UI), and other cases.</p>
<p>Here I describe how in 3 simple steps you can profile any java application using the wonderful Netbeans profiler.</p>
<p><strong>Step 1: Download and install the latest Netbeans version on your machine(s)</strong></p>
<p>On the <a href="http://www.netbeans.org/downloads/index.html" target="_blank">netbeans download page</a> choose the version adapted for your environment (Windows,Linux,Solaris,Mac&#8230;) and download/install it. All the bundles contain the profiler so I choose the lightest one: the JavaSE. If you want to profile a program running on a remote machine(s), you&#8217;ll have to download/install it on each machine.</p>
<p><strong>Step 2: Modify the command line that runs the java application that you want to profile/monitor</strong></p>
<p>You just have to add an argument to the Java VM.<br />
On windows, the argument to add is of the form:</p>

<div class="wp_syntax"><div class="code"><pre class="none" style="font-family:monospace;"> -agentpath:&quot;C:\Program Files\NetBeans 6.7.1\profiler3\lib\deployed\jdk16\windows\profilerinterface.dll&quot;=&quot;C:\Program Files\NetBeans 6.7.1\profiler3\lib,5140&quot;</pre></div></div>

<p>Replace the portion &#8220;C:\Program Files\NetBeans 6.7.1\profiler3&#8243; by the correct path (located where you installed Netbeans). Keep 5140, it is the port on which the application will listen for a remote profiler session (that you can also perform locally, as in this tutorial).<br />
On Linux, it is exactly the same, just look for the right path containing the profiler3 folder.<br />
So the java command line of the application to profile should look something like:</p>

<div class="wp_syntax"><div class="code"><pre class="none" style="font-family:monospace;">java -agentpath:&quot;C:\Program Files\NetBeans 6.7.1\profiler3\lib\deployed\jdk16\windows\profilerinterface.dll&quot;=&quot;C:\Program Files\NetBeans 6.7.1\profiler3\lib,5140&quot; MyApp param1 param2</pre></div></div>

<p>When launching this command, you should see on your console a message saying:<br />
<em>Profiler Agent: Waiting for connection on port 5140 (Protocol version: 9)</em><br />
meaning that the application is listening and waiting for a profiler session on port 5140.</p>
<p><strong><span style="color: red;">Note the flexibility behind this approach</span></strong>: it allows you to add this simple argument to the exsiting command of <em><strong>(i)</strong></em> any java applications running inside eclipse (in that case just open the &#8220;Run configuration&#8221; windows, in the &#8220;Arguments&#8221; tab just add the -agentpath option in the &#8220;VM arguments&#8221; section) or other IDE than Netbeans, <em><strong>(ii)</strong></em> any remote java applications <em><strong>(iii)</strong></em> any standalone jar file, or whatever existing java command that runs any kind of java application you can imagine&#8230;</p>
<p><strong>Step 3: Run the Netbeans profiler GUI</strong></p>
<p>Just open Netbeans, profile -&gt; attach profiler. Choose which kind of profiling/monitoring you need, you can also configure it.</p>
<p><a href="http://philippeadjiman.com/blog/wp-content/uploads/2009/10/ScreenShot086.jpg"><img class="aligncenter size-medium wp-image-452" title="attachProfiler" src="http://philippeadjiman.com/blog/wp-content/uploads/2009/10/ScreenShot086-300x215.jpg" alt="attachProfiler" width="300" height="215" /></a></p>
<p>Press Attach. Note that the first time you attach a profiler it may fail since you have to calibrate the profiler (in that case, a simple textbox will tell you how, it takes seconds).</p>
<p>That&#8217;s it!! You can now see in real time which part of your application is the heaviest, estimate what its memory footprint, analyze the threads and much more.</p>
<p><a href="http://philippeadjiman.com/blog/wp-content/uploads/2009/10/ScreenShot087.jpg"><img class="aligncenter size-medium wp-image-474" title="memory" src="http://philippeadjiman.com/blog/wp-content/uploads/2009/10/ScreenShot087-300x194.jpg" alt="memory" width="300" height="194" /></a></p>
<p>If you want even more, note that it also exists specific profilers for collections (HashMap, HashSet, ArrayList, &#8230;) like <a href="http://www.collectionspy.com/" target="_blank">collection spy</a> (not free).</p>
]]></content:encoded>
			<wfw:commentRss>http://philippeadjiman.com/blog/2009/10/20/flexible-java-profiling-using-the-netbeans-profiler/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>BeanShell Tutorial: Quick Start On Invoking Your Own Or External Java Code From The Shell</title>
		<link>http://philippeadjiman.com/blog/2009/10/17/beanshell-tutorial-quick-start-on-invoking-your-own-or-external-java-code-from-the-shell/</link>
		<comments>http://philippeadjiman.com/blog/2009/10/17/beanshell-tutorial-quick-start-on-invoking-your-own-or-external-java-code-from-the-shell/#comments</comments>
		<pubDate>Sat, 17 Oct 2009 20:58:58 +0000</pubDate>
		<dc:creator>padjiman</dc:creator>
				<category><![CDATA[java]]></category>
		<category><![CDATA[tutorial]]></category>
		<category><![CDATA[BeanShell]]></category>

		<guid isPermaLink="false">http://philippeadjiman.com/blog/?p=372</guid>
		<description><![CDATA[BeanShell is a lightweight scripting language that’s compatible with the Java language.
It provides a dynamic environment for executing Java code in its standard syntax but also allow common scripting conveniences such as loose types, commands, and method closures like those in Perl and JavaScript. It is considered so useful that it should became part of [...]]]></description>
			<content:encoded><![CDATA[<p style="text-align: left;"><a href="http://www.beanshell.org/" target="_blank"><img title="beanshell" src="http://philippeadjiman.com/blog/wp-content/uploads/2009/10/bshsplash3-300x167.gif" alt="bshsplash3" hspace="15" width="300" height="167" align="left" /></a><a href="http://www.beanshell.org/" target="_blank">BeanShell</a> is a lightweight scripting language that’s compatible with the Java language.<br />
It provides a dynamic environment for executing Java code in its standard syntax but also allow common scripting conveniences such as loose types, commands, and method closures like those in Perl and JavaScript. It is considered so useful that it should became part of the J2SE at some time in the future (the BeanShell Scripting Language JSR-274 , has  <a href="http://jcp.org/en/jsr/results?id=3208" target="_blank">passed the voting process</a> with flying colors).</p>
<p style="text-align: left;">Here I simply describe how to call you own code or any external existing code directly from the bean shell. You first have to download the last <a href="http://www.beanshell.org/download.html" target="_blank">bean shell jar release</a>. Let&#8217;s suppose that you put it in the directory &#8220;C:\libs&#8221; along with the famous <a href="http://commons.apache.org/lang/" target="_blank">Apache commons lang library</a>. So we suppose that &#8220;C:\libs&#8221; contains two jars called <em>bsh-2.0b4.jar</em> and <em>commons-lang-2.4.jar</em>.</p>
<p style="text-align: left;">Open a command prompt and type:</p>

<div class="wp_syntax"><div class="code"><pre class="bash" style="font-family:monospace;">java <span style="color: #660033;">-cp</span> C:\libs\bsh-2.0b4.jar;C:\libs\commons-lang-2.3.jar bsh.Interpreter</pre></div></div>

<p>You should see a prompt &#8220;bsh %&#8221; indicating that the bean shell session has started. So here an example of session using the method <strong>getLevenshteinDistance </strong>from the StringUtils utility class of the apache commons lang package:</p>

<div class="wp_syntax"><div class="code"><pre class="bash" style="font-family:monospace;">bsh <span style="color: #000000; font-weight: bold;">%</span> import  org.apache.commons.lang.StringUtils;
bsh <span style="color: #000000; font-weight: bold;">%</span> d = StringUtils.getLevenshteinDistance<span style="color: #7a0874; font-weight: bold;">&#40;</span><span style="color: #ff0000;">&quot;Louisville Slugger&quot;</span>, <span style="color: #ff0000;">&quot;Lousiville Slugger&quot;</span><span style="color: #7a0874; font-weight: bold;">&#41;</span>;
bsh <span style="color: #000000; font-weight: bold;">%</span> print<span style="color: #7a0874; font-weight: bold;">&#40;</span>d<span style="color: #7a0874; font-weight: bold;">&#41;</span>;
<span style="color: #000000;">2</span></pre></div></div>

<p>Note that instead of having to type the precise import, you can type instead:</p>

<div class="wp_syntax"><div class="code"><pre class="bash" style="font-family:monospace;">bsh <span style="color: #000000; font-weight: bold;">%</span> import <span style="color: #000000; font-weight: bold;">*</span>;</pre></div></div>

<p>This will trigger a set of &#8220;mappings&#8221; between the shell and the external jars that you specified in your classpath. By doing this, just remember that you are importing every possible class accessible from the classpath so it may force you to type the full path of classes in the case that two classes exists with the same name in different packages (it happens more often than one may think).</p>
<p>A good intermediary solution is to define a file called .bshrc and to put there all the specific imports that you are usually using. Then, while invoking the interpreter, just set the java system property <strong>user.home</strong> to the directory containing the .bshrc file. Let&#8217;s say for example that it is located in &#8220;C:\app\bshconfig&#8221;, you just have to type:</p>

<div class="wp_syntax"><div class="code"><pre class="bash" style="font-family:monospace;">java -Duser.home=C:\app\bshconfig <span style="color: #660033;">-cp</span> C:\libs\bsh-2.0b4.jar;C:\libs\commons-lang-2.3.jar bsh.Interpreter</pre></div></div>

<p>Note that you can add to the java command any options that you need (for instance you can use -Xmx if you need to).</p>
<p>For a complete doc of bean shell commands, consult the <a href="http://www.beanshell.org/docs.html" target="_blank">bean shell documentation page</a>.</p>
<p>For an eclipse plugin allowing you to perform auto-complete from the bean shell and other nice features, take a look at <a href="http://eclipse-shell.sourceforge.net/index.html" target="_blank">EclipseShell</a> (I didn&#8217;t tested it yet but the site contains nice screencasts and documentation).</p>
]]></content:encoded>
			<wfw:commentRss>http://philippeadjiman.com/blog/2009/10/17/beanshell-tutorial-quick-start-on-invoking-your-own-or-external-java-code-from-the-shell/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>5 Video Tutorials Of Small To Killer Eclipse Shortcuts</title>
		<link>http://philippeadjiman.com/blog/2009/10/11/5-video-tutorials-of-small-to-killer-eclipse-shortcuts/</link>
		<comments>http://philippeadjiman.com/blog/2009/10/11/5-video-tutorials-of-small-to-killer-eclipse-shortcuts/#comments</comments>
		<pubDate>Sun, 11 Oct 2009 16:29:06 +0000</pubDate>
		<dc:creator>padjiman</dc:creator>
				<category><![CDATA[java]]></category>
		<category><![CDATA[tutorial]]></category>
		<category><![CDATA[eclipse]]></category>

		<guid isPermaLink="false">http://philippeadjiman.com/blog/?p=348</guid>
		<description><![CDATA[ I believe that when you spend a significant percentage of your time on a specific software, it is an obligation to become &#8220;mouse-less&#8221; using it. Few years ago when I started to use the powerful eclipse shortcuts, I observed that my productivity was dramatically improving. You&#8217;ll be able to find a lot of posts [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.eclipse.org/"><img title="eclipse" src="http://philippeadjiman.com/blog/wp-content/uploads/2009/10/eclipse.png" alt="eclipse" hspace="15" width="128" height="128" align="left" /></a> I believe that when you spend a significant percentage of your time on a specific software, it is an obligation to become &#8220;mouse-less&#8221; using it. Few years ago when I started to use the powerful <a href="http://eclipse-tools.sourceforge.net/Keyboard_shortcuts_%283.0%29.pdf" target="_blank">eclipse shortcuts</a>, I observed that my productivity was dramatically improving. You&#8217;ll be able to find a lot of posts promoting some lists of &#8220;Top 10 eclipse shortcuts&#8221; (I like <a href="http://rayfd.wordpress.com/2007/05/20/10-eclipse-navigation-shortcuts-every-java-programmer-should-know/" target="_blank">this one</a>). I believe that small video tutorials can show more easily (rather than a bunch of screenshots) the power that some shortcuts can unleash.</p>
<p>So here 5 small video tutorials of shortcuts ranging from small ones to killer ones, all of them together making my day on eclipse much more easier and productive. The first two are small ones but still nice and useful. The remaining ones are more advanced and really have impact since you can potentially use them every couple of line of codes.</p>
<ol>
<h2>
<li> <strong>Ctrl + Alt + Arrow (up or down): duplicating lines.</strong></li>
</h2>
<p><strong>Impact on productivity:</strong><strong> low to medium</strong></p>
<p><!-- Smart Youtube --><span class="youtube"><object width="480" height="360"><param name="movie" value="http://www.youtube.com/v/U80IhJLLxE8&amp;rel=0&amp;color1=d6d6d6&amp;color2=f0f0f0&amp;border=0&amp;fs=1&amp;hl=en&amp;autoplay=0&amp;showinfo=0&amp;iv_load_policy=3&amp;showsearch=0&amp;ap=%2526fmt%3D22" /><param name="allowFullScreen" value="true" /><embed wmode="transparent" src="http://www.youtube.com/v/U80IhJLLxE8&amp;rel=0&amp;color1=d6d6d6&amp;color2=f0f0f0&amp;border=0&amp;fs=1&amp;hl=en&amp;autoplay=0&amp;showinfo=0&amp;iv_load_policy=3&amp;showsearch=0&amp;ap=%2526fmt%3D22" type="application/x-shockwave-flash" allowfullscreen="true" width="480" height="360" ></embed><param name="wmode" value="transparent" /></object></span></p>
<h2>
<li>Alt + Arrow (up or down): moving lines</li>
</h2>
<p><strong>Impact on productivity: low to medium</strong></p>
<p><!-- Smart Youtube --><span class="youtube"><object width="480" height="360"><param name="movie" value="http://www.youtube.com/v/9N8HUiYPAe0&amp;rel=0&amp;color1=d6d6d6&amp;color2=f0f0f0&amp;border=0&amp;fs=1&amp;hl=en&amp;autoplay=0&amp;showinfo=0&amp;iv_load_policy=3&amp;showsearch=0&amp;ap=%2526fmt%3D22" /><param name="allowFullScreen" value="true" /><embed wmode="transparent" src="http://www.youtube.com/v/9N8HUiYPAe0&amp;rel=0&amp;color1=d6d6d6&amp;color2=f0f0f0&amp;border=0&amp;fs=1&amp;hl=en&amp;autoplay=0&amp;showinfo=0&amp;iv_load_policy=3&amp;showsearch=0&amp;ap=%2526fmt%3D22" type="application/x-shockwave-flash" allowfullscreen="true" width="480" height="360" ></embed><param name="wmode" value="transparent" /></object></span></p>
<h2>
<li>Ctrl +1: How To Directly or Indirectly Use The Power Of Quick Fixes.</li>
</h2>
<p><strong>Impact on productivity: huge</strong></p>
<p><!-- Smart Youtube --><span class="youtube"><object width="480" height="360"><param name="movie" value="http://www.youtube.com/v/rnixsV-pEYk&amp;rel=0&amp;color1=d6d6d6&amp;color2=f0f0f0&amp;border=0&amp;fs=1&amp;hl=en&amp;autoplay=0&amp;showinfo=0&amp;iv_load_policy=3&amp;showsearch=0&amp;ap=%2526fmt%3D22" /><param name="allowFullScreen" value="true" /><embed wmode="transparent" src="http://www.youtube.com/v/rnixsV-pEYk&amp;rel=0&amp;color1=d6d6d6&amp;color2=f0f0f0&amp;border=0&amp;fs=1&amp;hl=en&amp;autoplay=0&amp;showinfo=0&amp;iv_load_policy=3&amp;showsearch=0&amp;ap=%2526fmt%3D22" type="application/x-shockwave-flash" allowfullscreen="true" width="480" height="360" ></embed><param name="wmode" value="transparent" /></object></span></p>
<h2>
<li>Alt + Shift + L: Extract Local Variables</li>
</h2>
<p><strong>Impact on productivity: medium</strong></p>
<p><!-- Smart Youtube --><span class="youtube"><object width="480" height="360"><param name="movie" value="http://www.youtube.com/v/6YkAKK5XQ5w&amp;rel=0&amp;color1=d6d6d6&amp;color2=f0f0f0&amp;border=0&amp;fs=1&amp;hl=en&amp;autoplay=0&amp;showinfo=0&amp;iv_load_policy=3&amp;showsearch=0&amp;ap=%2526fmt%3D22" /><param name="allowFullScreen" value="true" /><embed wmode="transparent" src="http://www.youtube.com/v/6YkAKK5XQ5w&amp;rel=0&amp;color1=d6d6d6&amp;color2=f0f0f0&amp;border=0&amp;fs=1&amp;hl=en&amp;autoplay=0&amp;showinfo=0&amp;iv_load_policy=3&amp;showsearch=0&amp;ap=%2526fmt%3D22" type="application/x-shockwave-flash" allowfullscreen="true" width="480" height="360" ></embed><param name="wmode" value="transparent" /></object></span></p>
<h2>
<li>Ctrl + Space: Beyond Auto Completion, The Template Assistant (+ customization)</li>
</h2>
<p><strong>Impact on productivity: high if heavily customized</strong></p>
<p><!-- Smart Youtube --><span class="youtube"><object width="480" height="360"><param name="movie" value="http://www.youtube.com/v/ZYwo6mTkT7A&amp;rel=0&amp;color1=d6d6d6&amp;color2=f0f0f0&amp;border=0&amp;fs=1&amp;hl=en&amp;autoplay=0&amp;showinfo=0&amp;iv_load_policy=3&amp;showsearch=0&amp;ap=%2526fmt%3D22" /><param name="allowFullScreen" value="true" /><embed wmode="transparent" src="http://www.youtube.com/v/ZYwo6mTkT7A&amp;rel=0&amp;color1=d6d6d6&amp;color2=f0f0f0&amp;border=0&amp;fs=1&amp;hl=en&amp;autoplay=0&amp;showinfo=0&amp;iv_load_policy=3&amp;showsearch=0&amp;ap=%2526fmt%3D22" type="application/x-shockwave-flash" allowfullscreen="true" width="480" height="360" ></embed><param name="wmode" value="transparent" /></object></span></p>
<p>Except those, I highly recommend to heavily use those five ones (for which I think a video is less useful):</p>
<ul>
<li><strong>Ctrl + Shift + R</strong> (open resources)</li>
<li><strong>Ctrl + O</strong> (quick outline). Pressing Ctrl + O again will show inherited members.</li>
<li><strong>Ctrl + E</strong> (quick switch editor). Very handy to navigate between files.</li>
<li><strong>Alt + Shift + R</strong> (rename variable). A very powerful one since it resolves all the possible dependencies on the renamed variable (works also on filenames).</li>
<li><strong>Ctrl + T</strong> (quick type hierarchy).</li>
</ul>
<p>Become as much mouse-less as possible in Eclipse. Don&#8217;t try to start using them all in one day, try to integrate one per day, even week. You&#8217;ll end up much more productive anyway.</p>
]]></content:encoded>
			<wfw:commentRss>http://philippeadjiman.com/blog/2009/10/11/5-video-tutorials-of-small-to-killer-eclipse-shortcuts/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Open Calais From Java: Get Ready To Extract Entities, Facts And Events In 4 Minutes!</title>
		<link>http://philippeadjiman.com/blog/2009/09/16/open-calais-from-java-with-eclipse-extract-entities-facts-and-events-in-4-minutes/</link>
		<comments>http://philippeadjiman.com/blog/2009/09/16/open-calais-from-java-with-eclipse-extract-entities-facts-and-events-in-4-minutes/#comments</comments>
		<pubDate>Wed, 16 Sep 2009 09:28:06 +0000</pubDate>
		<dc:creator>padjiman</dc:creator>
				<category><![CDATA[java]]></category>
		<category><![CDATA[tutorial]]></category>
		<category><![CDATA[eclipse]]></category>
		<category><![CDATA[open calais]]></category>

		<guid isPermaLink="false">http://philippeadjiman.com/blog/?p=129</guid>
		<description><![CDATA[I&#8217;m a big fan of Open Calais, the well known web service that allows you to perform Named Entity, Facts and Events Extraction on free english text (and now also in french since version 4.0).
In the video tutorial below, I show you how in only 4 minutes you can build the material that allows you [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;m a big fan of <a href="http://www.opencalais.com/" target="_blank">Open Calais</a>, the well known web service that allows you to perform Named Entity, Facts and Events Extraction on free english text (and now also in french since version 4.0).</p>
<p>In the video tutorial below, I show you how in only 4 minutes you can build the material that allows you to make a call to the Open Calais web service from a Java program, and to  perform Entity, Facts and Events Extraction on a news article took from CNN.</p>
<p>The tutorial supposes that you already have <a href="http://java.sun.com/javase/downloads/index.jsp" target="_blank">Java </a>and <a href="http://www.eclipse.org/downloads/" target="_blank">Eclipse for Java EE</a> developers installed along with an Open Calais API developer key (else go get one <a href="http://www.opencalais.com/user/register" target="_blank">here</a>, it is a very light process to obtain the key).</p>
<p>Note that you can watch the tutorial in <strong>HD</strong>.</p>
<p>Also, check the remarks below to more easily reproduce and  get more detailed explanations on what you&#8217;ll see in the tutorial.</p>
<p><strong>To see the video in its best quality, just <a href="http://www.youtube.com/watch?v=zUAvGh42tw4" target="_blank">click here</a>.</strong></p>
<p><!-- Smart Youtube --><span class="youtube"><object width="480" height="360"><param name="movie" value="http://www.youtube.com/v/zUAvGh42tw4&amp;rel=0&amp;color1=d6d6d6&amp;color2=f0f0f0&amp;border=0&amp;fs=1&amp;hl=en&amp;autoplay=0&amp;showinfo=0&amp;iv_load_policy=3&amp;showsearch=0&amp;ap=%2526fmt%3D22" /><param name="allowFullScreen" value="true" /><embed wmode="transparent" src="http://www.youtube.com/v/zUAvGh42tw4&amp;rel=0&amp;color1=d6d6d6&amp;color2=f0f0f0&amp;border=0&amp;fs=1&amp;hl=en&amp;autoplay=0&amp;showinfo=0&amp;iv_load_policy=3&amp;showsearch=0&amp;ap=%2526fmt%3D22" type="application/x-shockwave-flash" allowfullscreen="true" width="480" height="360" ></embed><param name="wmode" value="transparent" /></object></span></p>
<p><strong>Remarks/Complementary information:</strong></p>
<ul>
<li>The open calais web service WSDL showed in the demo is: <a href="http://api.opencalais.com/enlighten/?wsdl" target="_blank">http://api.opencalais.com/enlighten/?wsdl</a></li>
<li>The method <strong>enlighten</strong> which allows to call the Open Calais web service via soap has three parameters:
<ul>
<li><em>licenseId</em>. This is your API key that you can get <a href="http://www.opencalais.com/user/register" target="_blank">here</a>.</li>
<li><em>paramsXML</em>. Those are the INPUT parameters of the service in XML format (documentation <a href="http://www.opencalais.com/documentation/calais-web-service-api/forming-api-calls/input-parameters" target="_blank">here</a>). In the tutorial, for sake of simplicity I put the parameter as a raw String, of course it is better to read them from a file. Here are the parameters that I used:  <a href="http://philippeadjiman.com/blog/wp-content/uploads/2009/09/calaisParams.xml" target="_blank">calaisParams.xml</a>.</li>
<li><em>content</em>. This is the content on which the extraction will be performed. Again, for sake of simplicity I put the parameter as a raw String, and again, it is of course better to read it from a file (put whatever free text you want there). Here the <a href="http://philippeadjiman.com/blog/wp-content/uploads/2009/09/content1.txt" target="_blank">content</a> I used (from CNN).</li>
</ul>
</li>
<li>Pasting in a Java source code a long text copied from the web can be a nightmare because of the escape characters. The workaround I used in the demo is this <a href="http://rishida.net/scripts/uniview/conversion.php" target="_blank">general converter</a> that knows (among other things) where to add the &#8216;\&#8217; automatically at the good place.</li>
<li>Here is the <a href="http://philippeadjiman.com/blog/wp-content/uploads/2009/09/output.txt" target="_blank">output</a> of the tutorial.</li>
<li>Here is the list of <a href="http://www.opencalais.com/documentation/calais-web-service-api/interpreting-api-response/simple-format" target="_blank">Open Calais possible outputs</a>.</li>
</ul>
<p>If you&#8217;re like me, you&#8217;re obviously more interested about the algorithms behind the scene. To know more about the methods/algorithms involved, you can read about <a href="http://en.wikipedia.org/wiki/Morphology_%28linguistics%29" target="_blank">morphological analysis</a>, <a href="http://en.wikipedia.org/wiki/Part-of-speech_tagging" target="_blank">POS tagging</a>, <a href="http://www.cs.tau.ac.il/~nachumd/NLP/shallow-parsing.pdf" target="_blank">Shallow Parsing</a>. On the Open Calais website, they also mention <a href="http://www.opencalais.com/how-does-calais-learn" target="_blank">in a discussion</a> that they have developed their own rule-based system with their own programming language. They are also using lexicons.</p>
<p>The problems addressed by Open Calais are tough and it&#8217;s hard to be perfect, but I think they are doing a pretty good job at it. It would be interesting to compare relevance results with the <a href="http://www.alchemyapi.com/" target="_blank">Alchemy API</a> that offers pretty much the same service.</p>
]]></content:encoded>
			<wfw:commentRss>http://philippeadjiman.com/blog/2009/09/16/open-calais-from-java-with-eclipse-extract-entities-facts-and-events-in-4-minutes/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>The Trick To Write A Fast (Universal) Java URL Expander</title>
		<link>http://philippeadjiman.com/blog/2009/09/07/the-trick-to-write-a-fast-universal-java-url-expander/</link>
		<comments>http://philippeadjiman.com/blog/2009/09/07/the-trick-to-write-a-fast-universal-java-url-expander/#comments</comments>
		<pubDate>Mon, 07 Sep 2009 10:13:13 +0000</pubDate>
		<dc:creator>padjiman</dc:creator>
				<category><![CDATA[java]]></category>
		<category><![CDATA[twitter]]></category>

		<guid isPermaLink="false">http://philippeadjiman.com/blog/?p=81</guid>
		<description><![CDATA[140 characters. Means something to you?
This is about how twitter (and micro-blogging) was born. Even if some profane firefox extensions try to work around this, when it comes to insert (long) urls you may be in trouble to stick to the rule.
And here comes URL shortening services.
Pretty simple: The long URL http://philippeadjiman.com/blog/2009/09/01/can-you-guess-what-is-the-hottest-trend-of-google-hot-trends/ becomes http://bit.ly/miUkz that [...]]]></description>
			<content:encoded><![CDATA[<p style="text-align: left;">140 characters. Means something to you?</p>
<p style="text-align: left;">This is about how twitter (and micro-blogging) <a href="http://www.140characters.com/2009/01/30/how-twitter-was-born/" target="_blank">was born</a>. Even if some profane firefox extensions try to <a href="http://shorttext.com/twitzer.aspx" target="_blank">work around this</a>, when it comes to insert (long) urls you may be in trouble to stick to the rule.</p>
<p style="text-align: left;">And here comes URL shortening services.</p>
<p style="text-align: left;">Pretty simple: The long URL <a href="140 characters. Means something to you?" target="_blank">http://philippeadjiman.com/blog/2009/09/01/can-you-guess-what-is-the-hottest-trend-of-google-hot-trends/</a> becomes <a href="http://bit.ly/miUkz" target="_blank">http://bit.ly/miUkz</a> that will nicely fit in your next tweet.</p>
<p style="text-align: left;">Now everyone wants to shorten URLs. Here is a list of <a href="http://mashable.com/2008/01/08/url-shortening-services/" target="_blank">90 + URL shortening services</a> (!!) without counting the ones that you can <a href="http://lifehacker.com/5335216/make-your-own-url-shortening-service" target="_blank">build by yourself</a>.</p>
<p style="text-align: left;">How we (developers) can survive in this jungle if we want to retrieve the real expended version of those tons of URLs?</p>
<p style="text-align: left;">Well, a naive JAVA version would be:</p>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;"><span style="color: #000000; font-weight: bold;">public</span> <span style="color: #003399;">String</span> NaiveURLExpander<span style="color: #009900;">&#40;</span><span style="color: #003399;">String</span> address<span style="color: #009900;">&#41;</span> <span style="color: #000000; font-weight: bold;">throws</span> <span style="color: #003399;">IOException</span> <span style="color: #009900;">&#123;</span>
        <span style="color: #003399;">String</span> result<span style="color: #339933;">;</span>
        <span style="color: #003399;">URLConnection</span> conn <span style="color: #339933;">=</span> <span style="color: #000066; font-weight: bold;">null</span><span style="color: #339933;">;</span>
        <span style="color: #003399;">InputStream</span>  in <span style="color: #339933;">=</span> <span style="color: #000066; font-weight: bold;">null</span><span style="color: #339933;">;</span>
        <span style="color: #003399;">URL</span> url <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">URL</span><span style="color: #009900;">&#40;</span>address<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
        conn <span style="color: #339933;">=</span> url.<span style="color: #006633;">openConnection</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
        in <span style="color: #339933;">=</span> conn.<span style="color: #006633;">getInputStream</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
        result <span style="color: #339933;">=</span> conn.<span style="color: #006633;">getURL</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>.<span style="color: #006633;">toString</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
        in.<span style="color: #006633;">close</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
        <span style="color: #000000; font-weight: bold;">return</span> result<span style="color: #339933;">;</span>
    <span style="color: #009900;">&#125;</span></pre></div></div>

<p>Nice. It works. But it is terribly slow.<br />
Why?Because when you analyze what happens behind the scene, the HTTP header of the new created short URL contains the line</p>

<div class="wp_syntax"><div class="code"><pre class="bash" style="font-family:monospace;">HTTP<span style="color: #000000; font-weight: bold;">/</span><span style="color: #000000;">1.1</span> <span style="color: #000000;">301</span> Moved</pre></div></div>

<p>If you check the <a href="http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html" target="_blank">status code definition</a> of the HTTP protocol, you will see that means that the URL has moved permanently and that the new one should be located in the <strong>Location</strong> field of the HTTP header. In other words, the above java code behaves exactly as your browser: it performs a redirection, which is terribly slow.</p>
<p>So here is the trick:</p>
<ol>
<li>Use an <strong>HttpURLConnection </strong>object to be able to specify via the <strong>setInstanceFollowRedirects </strong>method to <span style="text-decoration: underline;">not</span> automatically redirect (like a browser will do) while connecting.</li>
<li>Extract the <strong>Location </strong>value in the HTTP header.</li>
</ol>
<p>Here you go:</p>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;"> <span style="color: #000000; font-weight: bold;">public</span> <span style="color: #003399;">String</span> expandShortURL<span style="color: #009900;">&#40;</span><span style="color: #003399;">String</span> address<span style="color: #009900;">&#41;</span> <span style="color: #000000; font-weight: bold;">throws</span> <span style="color: #003399;">IOException</span> <span style="color: #009900;">&#123;</span>
        <span style="color: #003399;">URL</span> url <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">URL</span><span style="color: #009900;">&#40;</span>address<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
        <span style="color: #003399;">HttpURLConnection</span> connection <span style="color: #339933;">=</span> <span style="color: #009900;">&#40;</span><span style="color: #003399;">HttpURLConnection</span><span style="color: #009900;">&#41;</span> url.<span style="color: #006633;">openConnection</span><span style="color: #009900;">&#40;</span><span style="color: #003399;">Proxy</span>.<span style="color: #006633;">NO_PROXY</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span> <span style="color: #666666; font-style: italic;">//using proxy may increase latency</span>
        connection.<span style="color: #006633;">setInstanceFollowRedirects</span><span style="color: #009900;">&#40;</span><span style="color: #000066; font-weight: bold;">false</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
        connection.<span style="color: #006633;">connect</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
        <span style="color: #003399;">String</span> expandedURL <span style="color: #339933;">=</span> connection.<span style="color: #006633;">getHeaderField</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;Location&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
        connection.<span style="color: #006633;">getInputStream</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>.<span style="color: #006633;">close</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
        <span style="color: #000000; font-weight: bold;">return</span> expandedURL<span style="color: #339933;">;</span>
    <span style="color: #009900;">&#125;</span></pre></div></div>

<p>If you are more a PHP guy, I saw a similar post that explain <a href="http://hasin.wordpress.com/2009/05/05/expanding-short-urls-to-original-urls-using-php-and-curl/" target="_blank">how to do it using PHP and curl</a>.</p>
<p>Note that for sake of conciseness, I do not manage errors int the code. Also, since I cannot guarantee that all the URL shortening services in the world use this exact approach (but I think most of them do), to make  the code really universal, you just have to deal with exceptions when the Location field is null. Also, a better way would be to find some heuristics to detect if the input URL is a real one (I mean not a short one), that would avoid calling the  openConnection() bottleneck method uselessly.</p>
<p>Finally, if some URL shortening services are not robust enough to check their own URLs, you also may have to deal with a corner case of &#8220;transitive shortening&#8221;  (I&#8217;m sure there will be always some curious people that will try to shorten an already shortened URL&#8230;). <strong>Update</strong>: check this example: <a href="http://bit.ly/4XzVxm" target="_blank">http://bit.ly/4XzVxm</a> points to <a href="http://tcrn.ch/6c8AU4" target="_blank">http://tcrn.ch/6c8AU4</a> which is itself another short url!</p>
<p>Also to achieve real performance, such code should be multithreaded. If you have to expand millions of URLs you would probably need to use many machines. Also, a time limit should be added to avoid too long connection, with a mechanism similar to a <a href="http://java.sun.com/j2se/1.4.2/docs/api/java/util/TimerTask.html" target="_blank">TimerTask</a>.</p>
<p>Note that this trick makes the code <strong>5 to 6 times faster</strong>. When it comes to deal with millions of short URLs, it makes a difference.</p>
]]></content:encoded>
			<wfw:commentRss>http://philippeadjiman.com/blog/2009/09/07/the-trick-to-write-a-fast-universal-java-url-expander/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
