You have a collection of text documents, and you want to build their TF-IDF weight vectors, probably before doing some clustering on the collection or other related tasks.
You would like to be able for instance to see what are the tokens with the biggest TF-IDF weights in any given document of the collection.
Lucene and Mahout can help you to do that almost in a snap.
Step 1 : Build a Lucene Index out of your document collection
If you don’t know how to build a Lucene index, check the links at the end of the post.
The two only important things in that step are to have in your index a field that can serve as a document id and to enable term vectors on the text field representing the content of your documents.
So your indexing code should contains at least two lines similar to:
doc.add(new Field("documentId", documentId, Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.add(new Field("content", content, Field.Store.YES, Field.Index.ANALYZED,TermVector.YES));
Step 2 : Use Mahout lucene.vector driver to generate weighted vectors from your lucene index
That step is well described here. It also explains how to generate the vectors from a directory of text documents. I used lucene because my documents were in a data store and building the lucene index out of it was just much more flexible and convenient.
You then should end up executing a command similar to:
./mahout lucene.vector --dir "myLucenIndexDirectory" --output "outputVectorPathAndFilename" --dictOut "outputDictionnaryPathAndFilename" -f content -i documentId -w TFIDF
Mahout will generate for you:
- a dictionary of all tokens found in the document collection (tokenized with the Tokenizer you used in step 1 and that you might tune depending on your needs)
- A binary SequenceFile (a class coming from hadoop) that will contains all the TF-IDF weighted vectors.
Having an erectile dysfunction generally ruins the marriage lowest price cialis of many people. Also the person who faces high viagra price level of cholesterol and high blood pressure. It is understandable that erectile dysfunction viagra uk prevention is always the best medicine. Experts believe that males should discuss their sexual problem so that they can get sexual satisfaction back on the discount viagra appalachianmagazine.com way.
Step 3: Play with the generated vector file
Now, let’s say that you want for a given document id, to see what are the tokens that received the biggest weights in order to feel what are the most significant tokens of that document (as the weighting scheme sees it).
To do so, you can for instance easily load the content of the generated dictionary file into a Map with token index as keys and the tokens as values. Let’s call that map dictionaryMap.
Then you’ll have to walk through the generated binary file containing the vectors. By playing a little bit with the sequence file and the Mahout source code, you get pretty quickly what are the important objects you have to manipulate in order to access vectors content in a structured way:
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
String vectorsPath = args[1];
Path path = new Path(vectorsPath);
SequenceFile.Reader reader = new SequenceFile.Reader(fs, path, conf);
LongWritable key = new LongWritable();
VectorWritable value = new VectorWritable();
while (reader.next(key, value)) {
NamedVector namedVector = (NamedVector)value.get();
RandomAccessSparseVector vect = (RandomAccessSparseVector)namedVector.getDelegate();
for( Element e : vect ){
System.out.println("Token: "+dictionaryMap.get(e.index())+", TF-IDF weight: "+e.get()) ;
}
}
reader.close();
The important things to get in that code are the following:
- namedVector.getName() will contains the documentId
- e.index() will ontains the index of the token as present in the dictionary output file, so you can get the token itself using
dictionaryMap.get(e.index()) - e.get() contains the weight itself
From there you’ll be able easily to plug your code to do whatever you want with the tokens and their weights, like printing the token having the biggest weights in a given document.
It can be insightful to tune your weighting model. E.g. you can quickly observe that typing errors are often getting a super high weight, which makes sense in the TF-IDF weighting scheme (unless the typing error is very frequent in your document collection), and thus you might want to fix that.
It is also useful just to understand a little bit more of how mahout represents the data internally.
Useful links:
- Not updated for the latest Lucene version but that blog post should get you quickly started on how to build a lucene index.
- My favorite Lucene book: Lucene in action second edition. Already a reference.
- The best book around Mahout so far: Mahout in action. Each chapter is pure gold.
- How to build the latest code version of mahout.