{"id":535,"date":"2009-11-11T10:27:43","date_gmt":"2009-11-11T15:27:43","guid":{"rendered":"http:\/\/philippeadjiman.com\/blog\/?p=535"},"modified":"2025-07-18T13:56:48","modified_gmt":"2025-07-18T13:56:48","slug":"flexible-collaborative-filtering-in-java-with-mahout-taste","status":"publish","type":"post","link":"https:\/\/philippeadjiman.com\/blog\/2009\/11\/11\/flexible-collaborative-filtering-in-java-with-mahout-taste\/","title":{"rendered":"Flexible Collaborative Filtering In JAVA With Mahout Taste"},"content":{"rendered":"\n<figure class=\"wp-block-image size-full\"><img data-recalc-dims=\"1\" decoding=\"async\" width=\"956\" height=\"400\" loading=\"lazy\" src=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2009\/11\/mahout-logo-transparent-400.png?resize=956%2C400&#038;ssl=1\" alt=\"\" class=\"wp-image-1936\" srcset=\"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2009\/11\/mahout-logo-transparent-400.png?w=956&amp;ssl=1 956w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2009\/11\/mahout-logo-transparent-400.png?resize=300%2C126&amp;ssl=1 300w, https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2009\/11\/mahout-logo-transparent-400.png?resize=768%2C321&amp;ssl=1 768w\" sizes=\"auto, (max-width: 956px) 100vw, 956px\" \/><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">I recently had to build quickly a prototype of recommendation engine for a promising start-up company. I wanted to first test state of the art <a href=\"http:\/\/www.google.com\/search?q=collaborative+filtering\" target=\"_blank\" rel=\"noreferrer noopener\">collaborative filtering<\/a> algorithms before to build a customized solution (potentially on top of those algorithms). Most importantly, I wanted to be able to compare quickly all the different algorithm configurations with which I would come up with. <a href=\"http:\/\/lucene.apache.org\/mahout\/taste.html\" target=\"_blank\" rel=\"noreferrer noopener\">Mahout Taste<\/a> (previously a <a href=\"http:\/\/taste.sourceforge.net\/\" target=\"_blank\" rel=\"noreferrer noopener\">sourceforge project<\/a> but recently promoted to the <a href=\"http:\/\/lucene.apache.org\/mahout\/\" target=\"_blank\" rel=\"noreferrer noopener\">Apache Mahout project<\/a>) was simply answering all those needs in one place.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">I describe below how in few easy steps, I was set up to express my creativity without having to reinvent the wheel. This tutorial is based on the 0.2 release of Mahout.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Step 1: Set up your environment with mahout taste<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">I usually use Eclipse with Maven to simply add a dependency but the mahout pom had some repository issues by the time I tried, so I worked around it by adding the required libraries in eclipse manually (all the libraries found in the directory lucene\/mahout\/trunk\/core\/target\/dependency of their <a href=\"http:\/\/lucene.apache.org\/mahout\/releases.html#Downloads\" target=\"_blank\" rel=\"noreferrer noopener\">latest release<\/a>).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Step 2: Familiarize yourself by building a simple recommendation engine based on the movie lens data<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To see a recommender engine in action, you can for instance download one of the <a href=\"http:\/\/www.grouplens.org\/node\/73\" target=\"_blank\" rel=\"noreferrer noopener\">movie Lens ratings data sets<\/a> (I choose the one with one million ratings). Unzip the archive somewhere. The file that will interest you is ratings.dat. Its format is as follows:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">userId::movieId::rating::timestamp<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">The basic mahout taste FileDataModel only accept the simple following format:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">userId,movieId,rating<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">There are many ways to transform your original file in that format, I used the simple following perl command:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">perl -F\"::\" -alne 'print \"@F[0],@F[1],@F[2]\"' ratings.dat &gt; ratingsForMahout.dat<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">You think: &#8220;what about the timestamp information???&#8221;. Yes, you right, it is a pretty crucial information given that it is based on temporal dynamics that the winning team of the <a href=\"http:\/\/www.netflixprize.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">Netflix prize<\/a> made the difference (BTW, if you&#8217;re interested in the subject, you must see <a href=\"http:\/\/videolectures.net\/kdd09_koren_cftd\/\" target=\"_blank\" rel=\"noreferrer noopener\">this video<\/a> of Yehuda Koren&#8217;s lecture at KDD).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">So, don&#8217;t worry, you can customize later your own DataModel class that parse any information you want, you&#8217;ll just have to implement the DataModel interface (you can also extends the FileDataModel class).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To obtain your first recommendations in few lines of code, you can use<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">import java.io.File;\nimport java.io.FileNotFoundException;\nimport java.util.List;\n\nimport org.apache.mahout.cf.taste.common.TasteException;\nimport org.apache.mahout.cf.taste.impl.model.file.FileDataModel;\nimport org.apache.mahout.cf.taste.impl.recommender.CachingRecommender;\nimport org.apache.mahout.cf.taste.impl.recommender.slopeone.SlopeOneRecommender;\nimport org.apache.mahout.cf.taste.model.DataModel;\nimport org.apache.mahout.cf.taste.recommender.RecommendedItem;\n\npublic class MahoutPlaying {\n\tpublic static void main(String[] args) throws FileNotFoundException, TasteException {\n\t\tDataModel model;\n\t\tmodel = new FileDataModel(new File(\"\/home\/padjiman\/data\/movieLens\/mahout\/ratingsForMahout.dat\"));\n\t\tCachingRecommender cachingRecommender = new CachingRecommender(new SlopeOneRecommender(model));\n\n\t\tList recommendations = cachingRecommender.recommend(1, 10);\n\t\tfor (RecommendedItem recommendedItem : recommendations) {\n\t\t\tSystem.out.println(recommendedItem);\n\t\t}\n\n\t}\n}<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">which creates in few lines of code a <a href=\"http:\/\/www.daniel-lemire.com\/fr\/abstracts\/SDM2005.html\" target=\"_blank\" rel=\"noreferrer noopener\">slope one<\/a> recommendation engine and print the 10 first recommendations for user 1. You&#8217;ll see there only movieIds so you&#8217;ll have to check the file movies.dat to see the actual movie title (you can also write a simple method or script that shows you directly the movie title if you want to play with several users or to create your own user).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">You can replace the slope one recommender with whatever other recommendation engine provided in the package. For instance, let&#8217;s say you want to use a classic user based recommender algorithm using the Pearson correlation similarity with a nearest 3 users neighborhood, simply replace the line that build the recommender in the above code by the code below:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">UserSimilarity userSimilarity = new PearsonCorrelationSimilarity(model);\nUserNeighborhood neighborhood = new NearestNUserNeighborhood(3, userSimilarity, model);\nRecommender recommender = new GenericUserBasedRecommender(model, neighborhood, userSimilarity);\nRecommender cachingRecommender = new CachingRecommender(recommender);<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Few issues you might have during step 2:<br>&#8211; OutOfMemoryError: the slope recommender is pretty greedy and on the 1 Million Movie Lens Dataset, you may have to set the -Xmx VM option to 1024m (in eclipse, just add -Xmx1024m to the VM arguments in the run configuration options).<br>&#8211; Some errors during the FileDataModel initialization: pay attention that the directory containing your file to parse does not contains other files starting with the same name; for some reasons it disturbs the DataModel initialization in some cases.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Step 3: Test the relevance of the algorithms<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In my opinion the most valuable part of the whole process. To feel immediately if your intuition of choosing a particular algorithm is a good one, or to see the good or bad impact of your own customized algorithm, you need a way to evaluate and compare them on the data.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">You can easily do that with mahout RecommenderEvaluator interface. Two different implementations of that interface are given: AverageAbsoluteDifferenceRecommenderEvaluator and RMSRecommenderEvaluator. The first one is the average absolute difference between predicted and actual ratings for users and the second one is the classic <a href=\"http:\/\/en.wikipedia.org\/wiki\/Root_mean_squared_error\" target=\"_blank\" rel=\"noreferrer noopener\">RMSE<\/a> (a.k.a. RMSD).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Since I&#8217;m playing with a movie dataset and that Netflix evaluation process was based on RMSE, I put here an example of use of the RMSRecommenderEvaluator:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">import java.io.File;\nimport java.io.IOException;\n\nimport org.apache.commons.cli2.OptionException;\nimport org.apache.mahout.cf.taste.common.TasteException;\nimport org.apache.mahout.cf.taste.eval.RecommenderBuilder;\nimport org.apache.mahout.cf.taste.eval.RecommenderEvaluator;\nimport org.apache.mahout.cf.taste.impl.eval.RMSRecommenderEvaluator;\nimport org.apache.mahout.cf.taste.impl.model.file.FileDataModel;\nimport org.apache.mahout.cf.taste.impl.recommender.CachingRecommender;\nimport org.apache.mahout.cf.taste.impl.recommender.slopeone.SlopeOneRecommender;\nimport org.apache.mahout.cf.taste.model.DataModel;\nimport org.apache.mahout.cf.taste.recommender.Recommender;\n\npublic final class EvaluationExample{\n\tpublic static void main(String... args) throws IOException, TasteException, OptionException {\n\n\t\tRecommenderBuilder builder = new RecommenderBuilder() {\n\t\t\tpublic Recommender buildRecommender(DataModel model) throws TasteException{\n\t\t\t\t\/\/build here whatever existing or customized recommendation algorithm\n\t\t\t\treturn new CachingRecommender(new SlopeOneRecommender(model));\n\t\t\t}\n\t\t};\n\n\t\tRecommenderEvaluator evaluator = new RMSRecommenderEvaluator();\n\t\tDataModel model = new FileDataModel(new File(\"\/home\/padjiman\/data\/movieLens\/mahout\/ratingsForMahout.dat\"));\n\t\tdouble score = evaluator.evaluate(builder,\n\t\t\t\tnull,\n\t\t\t\tmodel,\n\t\t\t\t0.9,\n\t\t\t\t1);\n\n\t\tSystem.out.println(score);\n\t}\n}<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Note that the evaluator need a RecommenderBuilder provided here as an inline implementation of the interface.<br>For a detailed description of the parameter of the evaluator, look at the javadoc in the sourcecode (as of today, the one that you&#8217;ll found on the web is outdated since it concern mahout release 0.1). But basically:<br>&#8211; 0.9 here represents the percentage of each user&#8217;s preferences to use to produce recommendations, the rest are compared to estimated preference values to evaluate.<br>&#8211; 1 represent the percentage of users to use in evaluation (so here all users).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Result?<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">RMSE = 0.8988.<br>To give you a point of comparison, the Netflix baseline predictor (called Cinematch) had an RMSE of 0.9514 and the Grand Prize was for the team providing an improvement of 10% (not that this tutorial is not based on netflix data but on Movie Lens data).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The number not really matters here: the important thing is that it provide you an easy way to compare different algorithms or the same algorithm with different settings (thresholds or other parameters).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Step 4: Now start the real work&#8230;<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">You guessed that you won&#8217;t win any Prize using the recommenders given by Mahout as-is :).<br>Depending on your data and on your needs, you may have either to simply customize an existing algorithm or to plug any specific similarity measure or to create your very own recommender from scratch. All of those are pretty easy to do in Mahout.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Let&#8217;s say for instance that you want to exploit the category of the movies to build a specific user similarity that includes this information.<br>What you will have to do is first to be able to capture the new information about categories.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To do so you can for instance extends the FileDataModel class to another class that also parses the movies.dat file and build relevant data structures to store the data about categories. I found more convenient to build my own Statistics object. Then you will have to build a new User similarity. It is as simple as that:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">import org.apache.mahout.cf.taste.common.Refreshable;\nimport org.apache.mahout.cf.taste.common.TasteException;\nimport org.apache.mahout.cf.taste.model.DataModel;\nimport org.apache.mahout.cf.taste.similarity.PreferenceInferrer;\nimport org.apache.mahout.cf.taste.similarity.UserSimilarity;\n\nimport com.padjiman.algo.Statistics;\n\npublic class ProfileSimilarity implements UserSimilarity {\n\n\tprivate final Statistics stats;\n\tprivate final DataModel dataModel;\n\n        public ProfileSimilarity(Statistics stats, DataModel dataModel) {\n\t\tsuper();\n\t\tif (stats == null) {\n\t\t\tthrow new IllegalArgumentException(\"stats is null\");\n\t\t}\n\t\tif (dataModel == null) {\n\t\t      throw new IllegalArgumentException(\"dataModel is null\");\n\t\t    }\n\t\tthis.dataModel = dataModel;\n\t\tthis.stats = stats;\n\t}\n\n\t@Override\n\tpublic double userSimilarity(long userID1, long userID2)\n\tthrows TasteException {\n\t\t\/\/build your similarity function here\n\t\t\/\/exploiting the stats and dataModel object as you wish\n\t}\n\n\t@Override\n\tpublic void refresh(Collection alreadyRefreshed) {\n\t\t\/\/ TODO Auto-generated method stub\n\t}\n\n\t@Override\n\tpublic void setPreferenceInferrer(PreferenceInferrer inferrer) {\n\t\t\/\/ TODO Auto-generated method stub\n\t}\n}<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Complete the method userSimilarity with you own secret sauce. Et voila: you can now plug your new user similarity measure in a GenericUserBasedRecommender, for instance instead of the Pearson correlation similarity measure (showed in step 2) and simply compare which one is the best using your evaluator.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">You&#8217;re not satisfied with the GenericUserBasedRecommender or any other recommender provided by Mahout? No problem, try to implement your own. You&#8217;ll just have to start with a class declaration of this kind:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">public class MostPopularItemUserBasedCombinedRecommender extends AbstractRecommender implements Recommender {\n       \/\/override the necessary methods\n}<\/pre>\n\n\n\n<p class=\"wp-block-paragraph\">Here again, you can use as member of the class any customized object containing any statistics that you would judge relevant to build a better recommender. Then, again, plug your new recommender in the evaluator and compare, experiment, improve.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Conclusion<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Mahout Taste is a very flexible platform to experiment collaborative filtering algorithms. It certainly won&#8217;t provide you an immediate solution to your recommendation problem, but you&#8217;ll be easily able to either tune the existing algorithms or plug your own creative ones into the mahout taste interfaces set.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">By doing so, you&#8217;ll immediately get the benefit of a platform allowing you to compare, tune and improve iteratively the results of your different algorithm configurations. Last but not least, Mahout taste provide an external server which exposes recommendation logic to your application via web services and HTTP.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Other ressources:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"\">After reading this quick start guide, I recommend you to have a look at the official <a href=\"http:\/\/lucene.apache.org\/mahout\/taste.html\" target=\"_blank\" rel=\"noreferrer noopener\">mahout taste documentation<\/a>. As of today it is not updated with the release 0.2 so you might find some old method signatures there but you&#8217;ll find useful and complementary information about the big picture of Mahout Taste design.<\/li>\n\n\n\n<li class=\"\">A <a href=\"http:\/\/www.ibm.com\/developerworks\/java\/library\/j-mahout\/\" target=\"_blank\" rel=\"noreferrer noopener\">nice article<\/a> on mahout in general (not only the taste part). I felt that Taste was not enough detailed there, in particular on the testing part, this is why I wrote this tutorial.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Rapid prototyping approach to a recommendation engine using Mahout Taste\u2019s pluggable similarity and scoring components.<\/p>\n","protected":false},"author":1,"featured_media":1936,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"nf_dc_page":"","_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[11,17],"tags":[23,30,35],"class_list":["post-535","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-java","category-tutorial","tag-experiments","tag-mahout","tag-recommendation-engine"],"jetpack_featured_media_url":"https:\/\/i0.wp.com\/philippeadjiman.com\/blog\/wp-content\/uploads\/2009\/11\/mahout-logo-transparent-400.png?fit=956%2C400&ssl=1","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/posts\/535","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/comments?post=535"}],"version-history":[{"count":2,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/posts\/535\/revisions"}],"predecessor-version":[{"id":1966,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/posts\/535\/revisions\/1966"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/media\/1936"}],"wp:attachment":[{"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/media?parent=535"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/categories?post=535"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/philippeadjiman.com\/blog\/wp-json\/wp\/v2\/tags?post=535"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}