<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Java. Internet. Algorithms. Ideas. &#187; lucene</title>
	<atom:link href="http://philippeadjiman.com/blog/tag/lucene/feed/" rel="self" type="application/rss+xml" />
	<link>http://philippeadjiman.com/blog</link>
	<description>Just Another Blog About Geek Stuff, by Philippe Adjiman</description>
	<lastBuildDate>Tue, 25 May 2010 06:58:19 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Writing A Token N-Grams Analyzer In Few Lines Of Code Using Lucene</title>
		<link>http://philippeadjiman.com/blog/2009/11/02/writing-a-token-n-grams-analyzer-in-few-lines-of-code-using-lucene/</link>
		<comments>http://philippeadjiman.com/blog/2009/11/02/writing-a-token-n-grams-analyzer-in-few-lines-of-code-using-lucene/#comments</comments>
		<pubDate>Mon, 02 Nov 2009 15:54:20 +0000</pubDate>
		<dc:creator>padjiman</dc:creator>
				<category><![CDATA[java]]></category>
		<category><![CDATA[tutorial]]></category>
		<category><![CDATA[lucene]]></category>

		<guid isPermaLink="false">http://philippeadjiman.com/blog/?p=523</guid>
		<description><![CDATA[ If you need to parse the tokens n-grams of a string, you may use the facilities offered by lucene analyzers.
What you simply have to do is to build you own analyzer using a ShingleMatrixFilter with the parameters that suits you needs. For instance, here the few lines of code to build a token bi-grams [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://lucene.apache.org/"><img class="alignleft size-full wp-image-524" title="lucene_green_300" src="http://philippeadjiman.com/blog/wp-content/uploads/2009/11/lucene_green_300.gif" alt="lucene_green_300" hspace="15" width="300" height="46" align="left" /></a> If you need to parse the tokens n-grams of a string, you may use the facilities offered by lucene analyzers.</p>
<p>What you simply have to do is to build you own analyzer using a ShingleMatrixFilter with the parameters that suits you needs. For instance, here the few lines of code to build a token bi-grams analyzer:</p>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;"><span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000000; font-weight: bold;">class</span> NGramAnalyzer <span style="color: #000000; font-weight: bold;">extends</span> Analyzer <span style="color: #009900;">&#123;</span>
	@Override
    <span style="color: #000000; font-weight: bold;">public</span> TokenStream tokenStream<span style="color: #009900;">&#40;</span><span style="color: #003399;">String</span> fieldName, <span style="color: #003399;">Reader</span> reader<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
       <span style="color: #000000; font-weight: bold;">return</span> <span style="color: #000000; font-weight: bold;">new</span> StopFilter<span style="color: #009900;">&#40;</span><span style="color: #000000; font-weight: bold;">new</span> LowerCaseFilter<span style="color: #009900;">&#40;</span><span style="color: #000000; font-weight: bold;">new</span> ShingleMatrixFilter<span style="color: #009900;">&#40;</span><span style="color: #000000; font-weight: bold;">new</span> StandardTokenizer<span style="color: #009900;">&#40;</span>reader<span style="color: #009900;">&#41;</span>,<span style="color: #cc66cc;">2</span>,<span style="color: #cc66cc;">2</span>,<span style="color: #0000ff;">' '</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span>,
           StopAnalyzer.<span style="color: #006633;">ENGLISH_STOP_WORDS</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
     <span style="color: #009900;">&#125;</span>
<span style="color: #009900;">&#125;</span></pre></div></div>

<p>The parameters of the ShingleMatrixFilter simply states the minimum and maximum shingle size. &#8220;Shingle&#8221; is just another name for token N-Grams and is popular to be the basic units to help solving problems in spell checking, near-duplicate detection and others.<br />
Note also the use of a StandardTokenizer to deal with basic special characters like hyphens or other &#8220;disturbers&#8221;. </p>
<p>To use the analyzer, you can for instance do:</p>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;">	<span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000000; font-weight: bold;">static</span> <span style="color: #000066; font-weight: bold;">void</span> main<span style="color: #009900;">&#40;</span><span style="color: #003399;">String</span><span style="color: #009900;">&#91;</span><span style="color: #009900;">&#93;</span> args<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
		<span style="color: #000000; font-weight: bold;">try</span> <span style="color: #009900;">&#123;</span>
			<span style="color: #003399;">String</span> str <span style="color: #339933;">=</span> <span style="color: #0000ff;">&quot;An easy way to write an analyzer for tokens bi-gram (or even tokens n-grams) with lucene&quot;</span><span style="color: #339933;">;</span>
			Analyzer analyzer <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> NGramAnalyzer<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
			TokenStream stream <span style="color: #339933;">=</span> analyzer.<span style="color: #006633;">tokenStream</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;content&quot;</span>, <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">StringReader</span><span style="color: #009900;">&#40;</span>str<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
			Token token <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> Token<span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
			<span style="color: #000000; font-weight: bold;">while</span> <span style="color: #009900;">&#40;</span><span style="color: #009900;">&#40;</span>token <span style="color: #339933;">=</span> stream.<span style="color: #006633;">next</span><span style="color: #009900;">&#40;</span>token<span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">!=</span> <span style="color: #000066; font-weight: bold;">null</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#123;</span>
				<span style="color: #003399;">System</span>.<span style="color: #006633;">out</span>.<span style="color: #006633;">println</span><span style="color: #009900;">&#40;</span>token.<span style="color: #006633;">term</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
			<span style="color: #009900;">&#125;</span>
&nbsp;
		<span style="color: #009900;">&#125;</span> <span style="color: #000000; font-weight: bold;">catch</span> <span style="color: #009900;">&#40;</span><span style="color: #003399;">IOException</span> ie<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
			<span style="color: #003399;">System</span>.<span style="color: #006633;">out</span>.<span style="color: #006633;">println</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;IO Error &quot;</span> <span style="color: #339933;">+</span> ie.<span style="color: #006633;">getMessage</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
		<span style="color: #009900;">&#125;</span>
	<span style="color: #009900;">&#125;</span></pre></div></div>

<p>The output will print:</p>

<div class="wp_syntax"><div class="code"><pre class="none" style="font-family:monospace;">an easy
easy way
way to
to write
write an
an analyzer
analyzer for
for tokens
tokens bi
bi gram
gram or
or even
even tokens
tokens n
n grams
grams with
with lucene</pre></div></div>

<p>Note that the text &#8220;bi-gram&#8221; was treated like two different tokens, as a desired consequence of using a StandardTokenizer in the ShingleMatrixFilter initialization.</p>
]]></content:encoded>
			<wfw:commentRss>http://philippeadjiman.com/blog/2009/11/02/writing-a-token-n-grams-analyzer-in-few-lines-of-code-using-lucene/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
