{"id":1084,"date":"2010-12-30T09:56:25","date_gmt":"2010-12-30T14:56:25","guid":{"rendered":"http:\/\/philippeadjiman.com\/blog\/?p=1084"},"modified":"2025-07-18T13:47:34","modified_gmt":"2025-07-18T13:47:34","slug":"how-to-easily-build-and-observe-tf-idf-weight-vectors-with-lucene-and-mahout","status":"publish","type":"post","link":"https:\/\/philippeadjiman.com\/blog\/2010\/12\/30\/how-to-easily-build-and-observe-tf-idf-weight-vectors-with-lucene-and-mahout\/","title":{"rendered":"How To Easily Build And Observe TF-IDF Weight Vectors With Lucene And Mahout"},"content":{"rendered":"\n<figure class=\"wp-block-image aligncenter size-large is-resized\"><img data-recalc-dims=\"1\" decoding=\"async\" width=\"1024\" height=\"259\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2010\/12\/c4fd8809-a608-4da3-9893-cd7bd0d6b21f.png?resize=1024%2C259&#038;ssl=1\" alt=\"\" class=\"wp-image-1890\" style=\"width:380px;height:auto\" srcset=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2010\/12\/c4fd8809-a608-4da3-9893-cd7bd0d6b21f.png?resize=1024%2C259&amp;ssl=1 1024w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2010\/12\/c4fd8809-a608-4da3-9893-cd7bd0d6b21f.png?resize=300%2C76&amp;ssl=1 300w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2010\/12\/c4fd8809-a608-4da3-9893-cd7bd0d6b21f.png?resize=768%2C194&amp;ssl=1 768w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2010\/12\/c4fd8809-a608-4da3-9893-cd7bd0d6b21f.png?resize=1536%2C388&amp;ssl=1 1536w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2010\/12\/c4fd8809-a608-4da3-9893-cd7bd0d6b21f.png?w=1780&amp;ssl=1 1780w\" sizes=\"auto, (max-width: 1000px) 100vw, 1000px\" \/><\/figure>\n\n\n<p>You have a collection of text documents, and you want to build their <a href=\"http:\/\/en.wikipedia.org\/wiki\/Tf%E2%80%93idf\">TF-IDF<\/a> weight vectors, probably before doing some <a href=\"http:\/\/alias-i.com\/lingpipe\/demos\/tutorial\/cluster\/read-me.html\" target=\"_blank\" rel=\"noopener\">clustering<\/a> on the collection or other related tasks.<\/p>\n<p>You would like to be able for instance to see what are the tokens with the biggest TF-IDF weights in any given document of the collection.<\/p>\n<p><a href=\"http:\/\/lucene.apache.org\/java\/docs\/index.html\" target=\"_blank\" rel=\"noopener\">Lucene<\/a> and\u00a0 <a href=\"http:\/\/mahout.apache.org\/\" target=\"_blank\" rel=\"noopener\">Mahout<\/a> can help you to do that almost in a snap.<\/p>\n<h3><strong>Step 1 : Build a Lucene Index out of your document collection<\/strong><\/h3>\n<p>If you don&#8217;t know how to build a Lucene index, check the links at the end of the post.<\/p>\n<p>The two only important things in that step are to have in your index a field that can serve as a document id and to enable term vectors on the text field representing the content of your documents.<\/p>\n<p>So your indexing code should contains at least two lines similar to:<\/p>\n<pre lang=\"java\">doc.add(new Field(\"documentId\", documentId, Field.Store.YES, Field.Index.NOT_ANALYZED));\ndoc.add(new Field(\"content\", content, Field.Store.YES, Field.Index.ANALYZED,TermVector.YES));<\/pre>\n<h3><strong>Step 2 : Use Mahout lucene.vector driver to generate weighted vectors from your lucene index <\/strong><\/h3>\n<p>That step is well described <a href=\"https:\/\/cwiki.apache.org\/confluence\/display\/MAHOUT\/Creating+Vectors+from+Text\">here<\/a>. It also explains how to generate the vectors from a directory of text documents. I used lucene because my documents were in a data store and building the lucene index out of it was just much more flexible and convenient.<\/p>\n<p>You then should end up executing a command similar to:<\/p>\n<pre lang=\"script\"> .\/mahout lucene.vector --dir \"myLucenIndexDirectory\" --output \"outputVectorPathAndFilename\" --dictOut \"outputDictionnaryPathAndFilename\" -f content -i documentId -w TFIDF<\/pre>\n<p>Mahout will generate for you:<\/p>\n<ul>\n<li>a dictionary of all tokens found in the document collection (tokenized with the <a href=\"http:\/\/lucene.apache.org\/java\/3_0_3\/api\/core\/org\/apache\/lucene\/analysis\/Tokenizer.html\" target=\"_blank\" rel=\"noopener\">Tokenizer<\/a> you used in step 1 and that you might tune depending on your needs)<\/li>\n<li>A binary <a href=\"http:\/\/hadoop.apache.org\/common\/docs\/current\/api\/org\/apache\/hadoop\/io\/SequenceFile.html\" target=\"_blank\" rel=\"noopener\">SequenceFile<\/a> (a class coming from <a href=\"http:\/\/hadoop.apache.org\/\" target=\"_blank\" rel=\"noopener\">hadoop<\/a>) that will contains all the TF-IDF weighted vectors.<\/li>\n<\/ul>\n<p><span id=\"y4e2d06b1\">Having an erectile dysfunction generally ruins the marriage <a href=\"http:\/\/appalachianmagazine.com\/2019\/06\/04\/rice-milk-the-ultimate-mountain-comfort-food\/\">lowest price cialis<\/a> of many people. Also the person who faces high <a href=\"http:\/\/appalachianmagazine.com\/2017\/11\/\">viagra price<\/a> level of cholesterol and high blood pressure. It is understandable that erectile dysfunction <a href=\"http:\/\/appalachianmagazine.com\/2017\/01\/10\/west-virginia-hunters-harvest-3012-black-bears-in-2016\/\">viagra uk<\/a> prevention is always the best medicine. Experts believe that males should discuss their sexual problem so that they can get sexual satisfaction back on the discount viagra <a href=\"http:\/\/appalachianmagazine.com\/2020\/03\/04\/17-year-cicadas-set-to-emerge-this-year-in-virginia-west-virginia-north-carolina\/\">appalachianmagazine.com<\/a> way. <\/span><\/p>\n<h3><strong>Step 3: Play with the generated vector file<\/strong><\/h3>\n<p>Now, let&#8217;s say that you want for a given document id, to see what are the tokens that received the biggest weights in order to feel what are the most significant tokens of that document (as the weighting scheme sees it).<\/p>\n<p>To do so, you can for instance easily load the content of the generated\u00a0dictionary file into a Map with token index as keys and the tokens as values. Let&#8217;s call that map dictionaryMap.<\/p>\n<p>Then you&#8217;ll have to walk through the generated binary file containing the vectors. By playing a little bit \u00a0with the sequence file and the Mahout source code, you get pretty quickly what are the important objects you have to manipulate in order to access vectors content in a structured way:<\/p>\n<pre lang=\"java\">Configuration conf = new Configuration();\nFileSystem fs = FileSystem.get(conf);\nString vectorsPath = args[1];\nPath path = new Path(vectorsPath);\n\nSequenceFile.Reader reader = new SequenceFile.Reader(fs, path, conf);\nLongWritable key = new LongWritable();\nVectorWritable value = new VectorWritable();\nwhile (reader.next(key, value)) {\n\tNamedVector namedVector = (NamedVector)value.get();\n\tRandomAccessSparseVector vect = (RandomAccessSparseVector)namedVector.getDelegate();\n\n\tfor( Element  e : vect ){\n\t\tSystem.out.println(\"Token: \"+dictionaryMap.get(e.index())+\", TF-IDF weight: \"+e.get()) ;\n\t}\n}\nreader.close();<\/pre>\n<p>The important things to get in that code are the following:<\/p>\n<p><!-- p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 13.0px Monaco} --><\/p>\n<ul>\n<li>namedVector.getName() will contains the documentId<\/li>\n<li><span style=\"font-family: Consolas, Monaco, 'Courier New', Courier, monospace;\">e.index()<\/span> will ontains the index of the token as present in the dictionary output file, so you can get the token itself using<br \/><span style=\"font-family: Consolas, Monaco, 'Courier New', Courier, monospace;\"> dictionaryMap.get(e.index())<\/span><\/li>\n<li><span style=\"font-family: Consolas, Monaco, 'Courier New', Courier, monospace;\"> e.get()<\/span> contains the weight itself<\/li>\n<\/ul>\n<p>From there you&#8217;ll be able easily to plug your code to do whatever you want with the tokens and their weights, like printing the token having the biggest weights in a given document.<\/p>\n<p>It can be insightful to tune your weighting model. E.g. you can quickly observe that typing errors are often getting a super high weight, which makes sense in the TF-IDF\u00a0weighting\u00a0scheme (unless the typing error is very frequent in your document collection), and thus you might want to fix that.<\/p>\n<p>It is also useful just to understand a little bit more of how mahout represents the data internally.<\/p>\n<p><strong>Useful links:<\/strong><\/p>\n<ul>\n<li>Not updated for the latest Lucene version but <a href=\"http:\/\/www.lucenetutorial.com\/lucene-in-5-minutes.html\" target=\"_blank\" rel=\"noopener\">that blog post<\/a> should get you quickly started on how to build a lucene index.<\/li>\n<li>My favorite Lucene book: <a href=\"http:\/\/www.manning.com\/hatcher3\/\" target=\"_blank\" rel=\"noopener\">Lucene in action second edition<\/a>. Already a reference.<\/li>\n<li>The best book around Mahout so far: <a href=\"http:\/\/www.manning.com\/owen\/\" target=\"_blank\" rel=\"noopener\">Mahout in action<\/a>. Each chapter is pure gold.<\/li>\n<li>How to <a href=\"https:\/\/cwiki.apache.org\/MAHOUT\/buildingmahout.html\" target=\"_blank\" rel=\"noopener\">build the latest code version of mahout<\/a>.<\/li>\n<\/ul>\n<p><\/p>","protected":false},"excerpt":{"rendered":"<p>Want to peek inside TF-IDF weights? Here\u2019s a quick way to build and analyze them without the headache.<\/p>\n","protected":false},"author":1,"featured_media":1890,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"nf_dc_page":"","_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[1],"tags":[],"class_list":["post-1084","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"jetpack_featured_media_url":"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2010\/12\/c4fd8809-a608-4da3-9893-cd7bd0d6b21f.png?fit=1780%2C450&ssl=1","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/posts\/1084","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/comments?post=1084"}],"version-history":[{"count":4,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/posts\/1084\/revisions"}],"predecessor-version":[{"id":1959,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/posts\/1084\/revisions\/1959"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/media\/1890"}],"wp:attachment":[{"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/media?parent=1084"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/categories?post=1084"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/tags?post=1084"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}