
If you need to parse the tokens n-grams of a string, you may use the facilities offered by Apache Lucene analyzers.
What you simply have to do is to build you own analyzer using a ShingleMatrixFilter with the parameters that suits you needs. For instance, here the few lines of code to build a token bi-grams analyzer:
public class NGramAnalyzer extends Analyzer {
@Override
public TokenStream tokenStream(String fieldName, Reader reader) {
return new StopFilter(new LowerCaseFilter(new ShingleMatrixFilter(new StandardTokenizer(reader),2,2,' ')),
StopAnalyzer.ENGLISH_STOP_WORDS);
}
}The parameters of the ShingleMatrixFilter simply states the minimum and maximum shingle size. “Shingle” is just another name for token N-Grams and is popular to be the basic units to help solving problems in spell checking, near-duplicate detection and others.
Note also the use of a StandardTokenizer to deal with basic special characters like hyphens or other “disturbers”.
To use the analyzer, you can for instance do:
public static void main(String[] args) {
try {
String str = "An easy way to write an analyzer for tokens bi-gram (or even tokens n-grams) with lucene";
Analyzer analyzer = new NGramAnalyzer();
TokenStream stream = analyzer.tokenStream("content", new StringReader(str));
Token token = new Token();
while ((token = stream.next(token)) != null){
System.out.println(token.term());
}
} catch (IOException ie) {
System.out.println("IO Error " + ie.getMessage());
}
}The output will print:
an easy
easy way
way to
to write
write an
an analyzer
analyzer for
for tokens
tokens bi
bi gram
gram or
or even
even tokens
tokens n
n grams
grams with
with lucene
Note that the text “bi-gram” was treated like two different tokens, as a desired consequence of using a StandardTokenizer in the ShingleMatrixFilter initialization.
