If you need to parse the tokens n-grams of a string, you may use the facilities offered by lucene analyzers.
What you simply have to do is to build you own analyzer using a ShingleMatrixFilter with the parameters that suits you needs. For instance, here the few lines of code to build a token bi-grams analyzer:
public class NGramAnalyzer extends Analyzer { @Override public TokenStream tokenStream(String fieldName, Reader reader) { return new StopFilter(new LowerCaseFilter(new ShingleMatrixFilter(new StandardTokenizer(reader),2,2,' ')), StopAnalyzer.ENGLISH_STOP_WORDS); } }
The parameters of the ShingleMatrixFilter simply states the minimum and maximum shingle size. “Shingle” is just another name for token N-Grams and is popular to be the basic units to help solving problems in spell checking, near-duplicate detection and others.
Note also the use of a StandardTokenizer to deal with basic special characters like hyphens or other “disturbers”.
To use the analyzer, you can for instance do:
public static void main(String[] args) { try { String str = "An easy way to write an analyzer for tokens bi-gram (or even tokens n-grams) with lucene"; Analyzer analyzer = new NGramAnalyzer(); TokenStream stream = analyzer.tokenStream("content", new StringReader(str)); Token token = new Token(); while ((token = stream.next(token)) != null){ System.out.println(token.term()); } } catch (IOException ie) { System.out.println("IO Error " + ie.getMessage()); } }
The output will print:
an easy easy way way to to write write an an analyzer analyzer for for tokens tokens bi bi gram gram or or even even tokens tokens n n grams grams with with lucene
Note that the text “bi-gram” was treated like two different tokens, as a desired consequence of using a StandardTokenizer in the ShingleMatrixFilter initialization.
