{"id":523,"date":"2009-11-02T10:54:20","date_gmt":"2009-11-02T15:54:20","guid":{"rendered":"http:\/\/philippeadjiman.com\/blog\/?p=523"},"modified":"2025-07-19T19:06:51","modified_gmt":"2025-07-19T19:06:51","slug":"writing-a-token-n-grams-analyzer-in-few-lines-of-code-using-lucene","status":"publish","type":"post","link":"https:\/\/philippeadjiman.com\/blog\/2009\/11\/02\/writing-a-token-n-grams-analyzer-in-few-lines-of-code-using-lucene\/","title":{"rendered":"Writing A Token N-Grams Analyzer In Few Lines Of Code Using Lucene"},"content":{"rendered":"\n<figure class=\"wp-block-image aligncenter size-full\"><img data-recalc-dims=\"1\" decoding=\"async\" width=\"1024\" height=\"159\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2009\/11\/Apache_Lucene_logo_pre_2021.svg_.png?resize=1024%2C159&#038;ssl=1\" alt=\"\" class=\"wp-image-1982\" srcset=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2009\/11\/Apache_Lucene_logo_pre_2021.svg_.png?w=1024&amp;ssl=1 1024w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2009\/11\/Apache_Lucene_logo_pre_2021.svg_.png?resize=300%2C47&amp;ssl=1 300w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2009\/11\/Apache_Lucene_logo_pre_2021.svg_.png?resize=768%2C119&amp;ssl=1 768w\" sizes=\"auto, (max-width: 1000px) 100vw, 1000px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">If you need to parse the tokens n-grams of a string, you may use the facilities offered by <a href=\"https:\/\/lucene.apache.org\/\">Apache Lucene<\/a> analyzers.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">What you simply have to do is to build you own analyzer using a ShingleMatrixFilter with the parameters that suits you needs. For instance, here the few lines of code to build a token bi-grams analyzer:<br><\/p>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-js\" data-lang=\"JavaScript\"><code>public class NGramAnalyzer extends Analyzer {\n\t@Override\n    public TokenStream tokenStream(String fieldName, Reader reader) {\n       return new StopFilter(new LowerCaseFilter(new ShingleMatrixFilter(new StandardTokenizer(reader),2,2,&#39; &#39;)),\n           StopAnalyzer.ENGLISH_STOP_WORDS);\n     }\n}<\/code><\/pre><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">The parameters of the ShingleMatrixFilter simply states the minimum and maximum shingle size. &#8220;Shingle&#8221; is just another name for token N-Grams and is popular to be the basic units to help solving problems in spell checking, near-duplicate detection and others.<br>Note also the use of a StandardTokenizer to deal with basic special characters like hyphens or other &#8220;disturbers&#8221;.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To use the analyzer, you can for instance do:<\/p>\n\n\n\n<div class=\"hcb_wrap\"><pre class=\"prism line-numbers lang-js\" data-lang=\"JavaScript\"><code>public static void main(String[] args) {\n\t\ttry {\n\t\t\tString str = &quot;An easy way to write an analyzer for tokens bi-gram (or even tokens n-grams) with lucene&quot;;\n\t\t\tAnalyzer analyzer = new NGramAnalyzer();\n\t\t\t\n\t\t\tTokenStream stream = analyzer.tokenStream(&quot;content&quot;, new StringReader(str));\n\t\t\tToken token = new Token();\n\t\t\twhile ((token = stream.next(token)) != null){\n\t\t\t\tSystem.out.println(token.term());\n\t\t\t}\n\t\t\t\n\t\t} catch (IOException ie) {\n\t\tSystem.out.println(&quot;IO Error &quot; + ie.getMessage());\n\t\t}\n\t}<\/code><\/pre><\/div>\n\n\n\n<pre class=\"wp-block-preformatted\">\t<br><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The output will print:<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>an easy\neasy way\nway to\nto write\nwrite an\nan analyzer\nanalyzer for\nfor tokens\ntokens bi\nbi gram\ngram or\nor even\neven tokens\ntokens n\nn grams\ngrams with\nwith lucene<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-preformatted\"><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Note that the text &#8220;bi-gram&#8221; was treated like two different tokens, as a desired consequence of using a StandardTokenizer in the ShingleMatrixFilter initialization.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Leverages Lucene analyzers to emit token n\u2011grams for downstream text mining or search tasks with minimal Java.<\/p>\n","protected":false},"author":1,"featured_media":1982,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"nf_dc_page":"","_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[11,17],"tags":[27,28],"class_list":["post-523","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-java","category-tutorial","tag-java","tag-lucene"],"jetpack_featured_media_url":"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2009\/11\/Apache_Lucene_logo_pre_2021.svg_.png?fit=1024%2C159&ssl=1","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/posts\/523","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/comments?post=523"}],"version-history":[{"count":5,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/posts\/523\/revisions"}],"predecessor-version":[{"id":1986,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/posts\/523\/revisions\/1986"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/media\/1982"}],"wp:attachment":[{"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/media?parent=523"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/categories?post=523"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/tags?post=523"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}