Archive for the ‘java’ Category

A Generic Method For Sorting (Google Collections) Multiset Per Entry Count

Saturday, February 20th, 2010

I’m regularly using the excellent google collections library (now final and part of the more general guava libraries). One of the data structure I’m using the most is probably the multiset (a.k.a bag). But most of the time, when I need a multiset to track the number of occurrences of particular entries, I almost always also need to know what is the most occurring entry (or the top N occurring entries).

Let’s take a canonical example: as you are parsing a text, you’re inserting each tokens into a multiset to track their number of occurrences and you simply want to know what are the top N most occurring tokens (ok, if you want to do it on terabytes of data, you might want to start learning hadoop :) ).

I need those kind of statistics  so frequently that I was surprised to not find an existing utility method allowing to sort the entries of a multiset per entry count (or number of occurrences). Here is my attempt to do it in a generic short and efficient way:

public <T> List<Entry<T>> sortMultisetPerEntryCount(Multiset<T> multiset){
	Comparator<Multiset.Entry<T>> occurence_comparator = new Comparator<Multiset.Entry<T>>() {
		public int compare(Multiset.Entry<T> e1, Multiset.Entry<T> e2) {
			return e2.getCount() - e1.getCount() ;
		}
	};
	List<Entry<T>> sortedByCount = new ArrayList<Entry<T>>(multiset.entrySet());
	Collections.sort(sortedByCount,occurence_comparator);
 
	return sortedByCount;
}

If you got any other better or most efficient way to do it (or if you know an existing utility method that does it), please share.

If you never used google collections, in addition to the official website, you might find those tutorials (1 and 2)  useful for an introduction.

Flexible Collaborative Filtering In JAVA With Mahout Taste

Wednesday, November 11th, 2009

Mahout-logo-164x200 I recently had to build quickly a prototype of recommendation engine for a promising start-up company. I wanted to first test state of the art collaborative filtering algorithms before to build a customized solution (potentially on top of those algorithms). Most importantly, I wanted to be able to compare quickly all the different algorithm configurations with which I would come up with. Mahout Taste (previously a sourceforge project but recently promoted to the Apache Mahout project) was simply answering all those needs in one place.

I describe below how in few easy steps, I was set up to express my creativity without having to reinvent the wheel. This tutorial is based on the 0.2 release of Mahout.

Step 1: Set up your environment with mahout taste

I usually use Eclipse with Maven to simply add a dependency but the mahout pom had some repository issues by the time I tried, so I worked around it by adding the required libraries in eclipse manually (all the libraries found in the directory lucene/mahout/trunk/core/target/dependency of their latest release).

Step 2: Familiarize yourself by building a simple recommendation engine based on the movie lens data

To see a recommender engine in action, you can for instance download one of the movie Lens ratings data sets (I choose the one with one million ratings). Unzip the archive somewhere. The file that will interest you is ratings.dat. Its format is as follows:

userId::movieId::rating::timestamp

The basic mahout taste FileDataModel only accept the simple following format:

userId,movieId,rating

There are many ways to transform your original file in that format, I used the simple following perl command:

perl -F"::" -alne 'print "@F[0],@F[1],@F[2]"' ratings.dat > ratingsForMahout.dat

You think: “what about the timestamp information???”. Yes, you right, it is a pretty crucial information given that it is based on temporal dynamics that the winning team of the Netflix prize made the difference (BTW, if you’re interested in the subject, you must see this video of Yehuda Koren’s lecture at KDD).

So, don’t worry, you can customize later your own DataModel class that parse any information you want, you’ll just have to implement the DataModel interface (you can also extends the FileDataModel class).

To obtain your first recommendations in few lines of code, you can use

import java.io.File;
import java.io.FileNotFoundException;
import java.util.List;
 
import org.apache.mahout.cf.taste.common.TasteException;
import org.apache.mahout.cf.taste.impl.model.file.FileDataModel;
import org.apache.mahout.cf.taste.impl.recommender.CachingRecommender;
import org.apache.mahout.cf.taste.impl.recommender.slopeone.SlopeOneRecommender;
import org.apache.mahout.cf.taste.model.DataModel;
import org.apache.mahout.cf.taste.recommender.RecommendedItem;
 
public class MahoutPlaying {
	public static void main(String[] args) throws FileNotFoundException, TasteException {
		DataModel model;
		model = new FileDataModel(new File("/home/padjiman/data/movieLens/mahout/ratingsForMahout.dat"));
		CachingRecommender cachingRecommender = new CachingRecommender(new SlopeOneRecommender(model));
 
		List recommendations = cachingRecommender.recommend(1, 10);
		for (RecommendedItem recommendedItem : recommendations) {
			System.out.println(recommendedItem);
		}
 
	}
}

which creates in few lines of code a slope one recommendation engine and print the 10 first recommendations for user 1. You’ll see there only movieIds so you’ll have to check the file movies.dat to see the actual movie title (you can also write a simple method or script that shows you directly the movie title if you want to play with several users or to create your own user).

You can replace the slope one recommender with whatever other recommendation engine provided in the package. For instance, let’s say you want to use a classic user based recommender algorithm using the Pearson correlation similarity with a nearest 3 users neighborhood, simply replace the line that build the recommender in the above code by the code below:

UserSimilarity userSimilarity = new PearsonCorrelationSimilarity(model);
UserNeighborhood neighborhood = new NearestNUserNeighborhood(3, userSimilarity, model);
Recommender recommender = new GenericUserBasedRecommender(model, neighborhood, userSimilarity);
Recommender cachingRecommender = new CachingRecommender(recommender);

Few issues you might have during step 2:
- OutOfMemoryError: the slope recommender is pretty greedy and on the 1 Million Movie Lens Dataset, you may have to set the -Xmx VM option to 1024m (in eclipse, just add -Xmx1024m to the VM arguments in the run configuration options).
- Some errors during the FileDataModel initialization: pay attention that the directory containing your file to parse does not contains other files starting with the same name; for some reasons it disturbs the DataModel initialization in some cases.

Step 3: Test the relevance of the algorithms

In my opinion the most valuable part of the whole process. To feel immediately if your intuition of choosing a particular algorithm is a good one, or to see the good or bad impact of your own customized algorithm, you need a way to evaluate and compare them on the data.

You can easily do that with mahout RecommenderEvaluator interface. Two different implementations of that interface are given: AverageAbsoluteDifferenceRecommenderEvaluator and RMSRecommenderEvaluator. The first one is the average absolute difference between predicted and actual ratings for users and the second one is the classic RMSE (a.k.a. RMSD).

Since I’m playing with a movie dataset and that Netflix evaluation process was based on RMSE, I put here an example of use of the RMSRecommenderEvaluator:

import java.io.File;
import java.io.IOException;
 
import org.apache.commons.cli2.OptionException;
import org.apache.mahout.cf.taste.common.TasteException;
import org.apache.mahout.cf.taste.eval.RecommenderBuilder;
import org.apache.mahout.cf.taste.eval.RecommenderEvaluator;
import org.apache.mahout.cf.taste.impl.eval.RMSRecommenderEvaluator;
import org.apache.mahout.cf.taste.impl.model.file.FileDataModel;
import org.apache.mahout.cf.taste.impl.recommender.CachingRecommender;
import org.apache.mahout.cf.taste.impl.recommender.slopeone.SlopeOneRecommender;
import org.apache.mahout.cf.taste.model.DataModel;
import org.apache.mahout.cf.taste.recommender.Recommender;
 
public final class EvaluationExample{
	public static void main(String... args) throws IOException, TasteException, OptionException {
 
		RecommenderBuilder builder = new RecommenderBuilder() {
			public Recommender buildRecommender(DataModel model) throws TasteException{
				//build here whatever existing or customized recommendation algorithm
				return new CachingRecommender(new SlopeOneRecommender(model));
			}
		};
 
		RecommenderEvaluator evaluator = new RMSRecommenderEvaluator();
		DataModel model = new FileDataModel(new File("/home/padjiman/data/movieLens/mahout/ratingsForMahout.dat"));
		double score = evaluator.evaluate(builder,
				null,
				model,
				0.9,
				1);
 
		System.out.println(score);
	}
}

Note that the evaluator need a RecommenderBuilder provided here as an inline implementation of the interface.
For a detailed description of the parameter of the evaluator, look at the javadoc in the sourcecode (as of today, the one that you’ll found on the web is outdated since it concern mahout release 0.1). But basically:
- 0.9 here represents the percentage of each user’s preferences to use to produce recommendations, the rest are compared to estimated preference values to evaluate.
- 1 represent the percentage of users to use in evaluation (so here all users).

Result?

RMSE = 0.8988.
To give you a point of comparison, the Netflix baseline predictor (called Cinematch) had an RMSE of 0.9514 and the Grand Prize was for the team providing an improvement of 10% (not that this tutorial is not based on netflix data but on Movie Lens data).

The number not really matters here: the important thing is that it provide you an easy way to compare different algorithms or the same algorithm with different settings (thresholds or other parameters).

Step 4: Now start the real work…

You guessed that you won’t win any Prize using the recommenders given by Mahout as-is :) .
Depending on your data and on your needs, you may have either to simply customize an existing algorithm or to plug any specific similarity measure or to create your very own recommender from scratch. All of those are pretty easy to do in Mahout.

Let’s say for instance that you want to exploit the category of the movies to build a specific user similarity that includes this information.
What you will have to do is first to be able to capture the new information about categories.

To do so you can for instance extends the FileDataModel class to another class that also parses the movies.dat file and build relevant data structures to store the data about categories. I found more convenient to build my own Statistics object. Then you will have to build a new User similarity. It is as simple as that:

import org.apache.mahout.cf.taste.common.Refreshable;
import org.apache.mahout.cf.taste.common.TasteException;
import org.apache.mahout.cf.taste.model.DataModel;
import org.apache.mahout.cf.taste.similarity.PreferenceInferrer;
import org.apache.mahout.cf.taste.similarity.UserSimilarity;
 
import com.padjiman.algo.Statistics;
 
public class ProfileSimilarity implements UserSimilarity {
 
	private final Statistics stats;
	private final DataModel dataModel;
 
        public ProfileSimilarity(Statistics stats, DataModel dataModel) {
		super();
		if (stats == null) {
			throw new IllegalArgumentException("stats is null");
		}
		if (dataModel == null) {
		      throw new IllegalArgumentException("dataModel is null");
		    }
		this.dataModel = dataModel;
		this.stats = stats;
	}
 
	@Override
	public double userSimilarity(long userID1, long userID2)
	throws TasteException {
		//build your similarity function here
		//exploiting the stats and dataModel object as you wish
	}
 
	@Override
	public void refresh(Collection alreadyRefreshed) {
		// TODO Auto-generated method stub
	}
 
	@Override
	public void setPreferenceInferrer(PreferenceInferrer inferrer) {
		// TODO Auto-generated method stub
	}
}

Complete the method userSimilarity with you own secret sauce. Et voila: you can now plug your new user similarity measure in a GenericUserBasedRecommender, for instance instead of the Pearson correlation similarity measure (showed in step 2) and simply compare which one is the best using your evaluator.

You’re not satisfied with the GenericUserBasedRecommender or any other recommender provided by Mahout? No problem, try to implement your own. You’ll just have to start with a class declaration of this kind:

public class MostPopularItemUserBasedCombinedRecommender extends AbstractRecommender implements Recommender {
       //override the necessary methods
}

Here again, you can use as member of the class any customized object containing any statistics that you would judge relevant to build a better recommender. Then, again, plug your new recommender in the evaluator and compare, experiment, improve.

Conclusion

Mahout Taste is a very flexible platform to experiment collaborative filtering algorithms. It certainly won’t provide you an immediate solution to your recommendation problem, but you’ll be easily able to either tune the existing algorithms or plug your own creative ones into the mahout taste interfaces set.

By doing so, you’ll immediately get the benefit of a platform allowing you to compare, tune and improve iteratively the results of your different algorithm configurations. Last but not least, Mahout taste provide an external server which exposes recommendation logic to your application via web services and HTTP.

Other ressources:

  • After reading this quick start guide, I recommend you to have a look at the official mahout taste documentation. As of today it is not updated with the release 0.2 so you might find some old method signatures there but you’ll find useful and complementary information about the big picture of Mahout Taste design.
  • A nice article on mahout in general (not only the taste part). I felt that Taste was not enough detailed there, in particular on the testing part, this is why I wrote this tutorial.

Writing A Token N-Grams Analyzer In Few Lines Of Code Using Lucene

Monday, November 2nd, 2009

lucene_green_300 If you need to parse the tokens n-grams of a string, you may use the facilities offered by lucene analyzers.

What you simply have to do is to build you own analyzer using a ShingleMatrixFilter with the parameters that suits you needs. For instance, here the few lines of code to build a token bi-grams analyzer:

public class NGramAnalyzer extends Analyzer {
	@Override
    public TokenStream tokenStream(String fieldName, Reader reader) {
       return new StopFilter(new LowerCaseFilter(new ShingleMatrixFilter(new StandardTokenizer(reader),2,2,' ')),
           StopAnalyzer.ENGLISH_STOP_WORDS);
     }
}

The parameters of the ShingleMatrixFilter simply states the minimum and maximum shingle size. “Shingle” is just another name for token N-Grams and is popular to be the basic units to help solving problems in spell checking, near-duplicate detection and others.
Note also the use of a StandardTokenizer to deal with basic special characters like hyphens or other “disturbers”.

To use the analyzer, you can for instance do:

	public static void main(String[] args) {
		try {
			String str = "An easy way to write an analyzer for tokens bi-gram (or even tokens n-grams) with lucene";
			Analyzer analyzer = new NGramAnalyzer();
 
			TokenStream stream = analyzer.tokenStream("content", new StringReader(str));
			Token token = new Token();
			while ((token = stream.next(token)) != null){
				System.out.println(token.term());
			}
 
		} catch (IOException ie) {
			System.out.println("IO Error " + ie.getMessage());
		}
	}

The output will print:

an easy
easy way
way to
to write
write an
an analyzer
analyzer for
for tokens
tokens bi
bi gram
gram or
or even
even tokens
tokens n
n grams
grams with
with lucene

Note that the text “bi-gram” was treated like two different tokens, as a desired consequence of using a StandardTokenizer in the ShingleMatrixFilter initialization.

Drawing A Zipf Law Using Gnuplot, Java and Moby-Dick

Monday, October 26th, 2009

whaleThere are many tools out there to build more or less quickly any kind of graphs. Depending on your needs a tool may be more suited than another. When it comes to draw graphs from a set of generated coordinates, I love the simplicity of gnuplot.

Let’s see together a simple example that explain how to draw a zipf law observed on a long english text.
If you’re not familiar with zipf law, simply put it states that the product of the rank (R) of a word and its frequency (F) is roughly constant. This law is also know under the name “principle of the least effort” because people tends to use the same words often and rarely use new or different words.

Step 1 : Install gnuplot

For mac, check this.
For linux, depending on your distrib it should be as simple as an apt-get install (for ubuntu you can check this howto).
For windows you can either go the “hard” way with cygwin + X11 (see Part 1,4 and 5 of those instructions) or the easy way by clicking on pgnuplot.exe located in the gpXXXwin32.zip located here (this last solution may be also easier if you want to have copy/paste between the gnuplot terminal and other windows).

Step 2: Generate the Zipf Law data using Java and Moby Dick!

As I told you above, gnuplot is particularly simple for drawing a set of generated coordinates. All you have to do is to generated a file containing on each line a couple of coordinates.

For the sake of the example, I will use the full raw text of Moby Dick to generate the points. The goal is to generate a list of points of the form x y where x represents the rank of the word (the more frequent the word is, the higher its rank) and y represents its number of occurrences.

Find below the java code I used to do that. If you want to execute it, you will need lucene and the google collections (soon to become part of guava) libraries.

import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Comparator;
import java.util.List;
 
import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
 
import com.google.common.collect.HashMultiset;
import com.google.common.collect.Multiset;
import com.google.common.collect.Multiset.Entry;
 
public class ZipfLawOnMobyDick {
	public static void main(String[] args) throws IOException {
 
		//Multiset for storing word occurrences
		Multiset multiset = HashMultiset.create();
 
		//Creating a standard analyzer with no stop words (we need them to observe the zipf law)
		String[] STOP_WORDS = {};
		StandardAnalyzer analyzer = new StandardAnalyzer(STOP_WORDS);
 
		//Initializing the multiset by parsing the whole content of Moby Dick
		TokenStream stream = analyzer.tokenStream("content", new FileReader(new File("C:\\moby_dick.txt")));
		Token token = new Token();
		while ((token = stream.next(token)) != null){
			multiset.add(token.term());
		}
 
		//Sorting the multiset by number of occurrences using a comparator on the Entries of the multiset
		List&gt; l = new ArrayList&gt;(multiset.entrySet());
		Comparator&gt; occurence_comparator = new Comparator&gt;() {
			public int compare(Multiset.Entry e1, Multiset.Entry e2) {
				return e2.getCount() - e1.getCount() ;
			}
		};
		Collections.sort(l,occurence_comparator);
 
		int rank = 1;
		for( Multiset.Entry e : l ){
			System.out.println(rank+"\t"+e.getCount());
			rank++;
		}
	}
}

This will generate the following output (the set of coordinates) that you can put in a file called moby_dick.gp. If you’re curious about what are the 100 hottest keywords of the whole text you can check them here.

Step 3: Drawing using gnuplot

What you can do first is simply to type the following command in the gnuplot console (you have to be on the same directory as the moby_dick.gp file):

plot [0:500][0:16000] "moby_dick.gp"

It simply draws the points and rescale the range of x and y respectively to [0:500] and [0:16000] so we can see something.
Play with the ranges to see the differences.
If you want the dots to be connected, just type:

plot [0:500][0:16000] "moby_dick.gp" with lines

If you want to add some legends, you can put some labels and arrows.
Here is an example of a gnuplot script that will set some information on the graph (you can simply copy/paste it in the gnuplot console):

set xlabel "word rank"
set ylabel "# of occurrences"
set label 1 "the word ranked #14 occurs 1753 times" at 70,4000
set arrow 1 from 65,3750 to 15,1800
plot [0:500][0:16000] "moby_dick.gp

As you can see it is pretty straightforward. You can play with the coordinates to adjust where to put the labels and arrow.
You will obtain this graph (click to enlarge):

moby_dick

To export it as a png file just type:

set terminal png
set output "moby_dick.png"
plot [0:500][0:16000] "moby_dick.gp"

You also might want to try a log scale on the vertical axis as to not waste the majority of the graph’s scale (thanks Bob for the remark).
To do so, you can simply type in the gnuplot console:

set logscale y

by plotting within the range [1:3000][5:10000], you’ll obtain:

moby_dick_semilog

Finally, you might want to use a log-log scale that are traditionally used to observe such power laws. Just set the logscale for x as you did for y and you’ll obtain:

moby_dick_loglog

You can of course add as much eye candies as you want (the demo page of the gnuplot website gives tons of example).

Also, there are probably dozens of ways to draw the same thing, I just loved the fun and simplicity of that one.

Flexible Java Profiling And Monitoring Using The Netbeans Profiler

Tuesday, October 20th, 2009

cpuProfile I have tested a lot of those open source profiler. My preference goes definitely to the integrated Netbeans profiler. It was simply the easiest and unified solution adapted to all the different settings I ever met, including profiling java applications that (i) were not developed under netbeans (ii) were only in the form of standalone jar (iii) were running on a remote Linux machine for which no X server were running (i.e. no UI), and other cases.

Here I describe how in 3 simple steps you can profile any java application using the wonderful Netbeans profiler.

Step 1: Download and install the latest Netbeans version on your machine(s)

On the netbeans download page choose the version adapted for your environment (Windows,Linux,Solaris,Mac…) and download/install it. All the bundles contain the profiler so I choose the lightest one: the JavaSE. If you want to profile a program running on a remote machine(s), you’ll have to download/install it on each machine.

Step 2: Modify the command line that runs the java application that you want to profile/monitor

You just have to add an argument to the Java VM.
On windows, the argument to add is of the form:

 -agentpath:"C:\Program Files\NetBeans 6.7.1\profiler3\lib\deployed\jdk16\windows\profilerinterface.dll"="C:\Program Files\NetBeans 6.7.1\profiler3\lib,5140"

Replace the portion “C:\Program Files\NetBeans 6.7.1\profiler3″ by the correct path (located where you installed Netbeans). Keep 5140, it is the port on which the application will listen for a remote profiler session (that you can also perform locally, as in this tutorial).
On Linux, it is exactly the same, just look for the right path containing the profiler3 folder.
So the java command line of the application to profile should look something like:

java -agentpath:"C:\Program Files\NetBeans 6.7.1\profiler3\lib\deployed\jdk16\windows\profilerinterface.dll"="C:\Program Files\NetBeans 6.7.1\profiler3\lib,5140" MyApp param1 param2

When launching this command, you should see on your console a message saying:
Profiler Agent: Waiting for connection on port 5140 (Protocol version: 9)
meaning that the application is listening and waiting for a profiler session on port 5140.

Note the flexibility behind this approach: it allows you to add this simple argument to the exsiting command of (i) any java applications running inside eclipse (in that case just open the “Run configuration” windows, in the “Arguments” tab just add the -agentpath option in the “VM arguments” section) or other IDE than Netbeans, (ii) any remote java applications (iii) any standalone jar file, or whatever existing java command that runs any kind of java application you can imagine…

Step 3: Run the Netbeans profiler GUI

Just open Netbeans, profile -> attach profiler. Choose which kind of profiling/monitoring you need, you can also configure it.

attachProfiler

Press Attach. Note that the first time you attach a profiler it may fail since you have to calibrate the profiler (in that case, a simple textbox will tell you how, it takes seconds).

That’s it!! You can now see in real time which part of your application is the heaviest, estimate what its memory footprint, analyze the threads and much more.

memory

If you want even more, note that it also exists specific profilers for collections (HashMap, HashSet, ArrayList, …) like collection spy (not free).