How To Easily Build And Observe TF-IDF Weight Vectors With Lucene And Mahout

December 30th, 2010

tfidfYou have a collection of text documents, and you want to build their TF-IDF weight vectors, probably before doing some clustering on the collection or other related tasks.

You would like to be able for instance to see what are the tokens with the biggest TF-IDF weights in any given document of the collection.

Lucene and  Mahout can help you to do that almost in a snap.

Step 1 : Build a Lucene Index out of your document collection

If you don’t know how to build a Lucene index, check the links at the end of the post.

The two only important things in that step are to have in your index a field that can serve as a document id and to enable term vectors on the text field representing the content of your documents.

So your indexing code should contains at least two lines similar to:

doc.add(new Field("documentId", documentId, Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.add(new Field("content", content, Field.Store.YES, Field.Index.ANALYZED,TermVector.YES));

Step 2 : Use Mahout lucene.vector driver to generate weighted vectors from your lucene index

That step is well described here. It also explains how to generate the vectors from a directory of text documents. I used lucene because my documents were in a data store and building the lucene index out of it was just much more flexible and convenient.

You then should end up executing a command similar to:

 ./mahout lucene.vector --dir "myLucenIndexDirectory" --output "outputVectorPathAndFilename" --dictOut "outputDictionnaryPathAndFilename" -f content -i documentId -w TFIDF

Mahout will generate for you:

  • a dictionary of all tokens found in the document collection (tokenized with the Tokenizer you used in step 1 and that you might tune depending on your needs)
  • A binary SequenceFile (a class coming from hadoop) that will contains all the TF-IDF weighted vectors.

Step 3: Play with the generated vector file

Now, let’s say that you want for a given document id, to see what are the tokens that received the biggest weights in order to feel what are the most significant tokens of that document (as the weighting scheme sees it).

To do so, you can for instance easily load the content of the generated dictionary file into a Map with token index as keys and the tokens as values. Let’s call that map dictionaryMap.

Then you’ll have to walk through the generated binary file containing the vectors. By playing a little bit  with the sequence file and the Mahout source code, you get pretty quickly what are the important objects you have to manipulate in order to access vectors content in a structured way:

Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
String vectorsPath = args[1];
Path path = new Path(vectorsPath);
 
SequenceFile.Reader reader = new SequenceFile.Reader(fs, path, conf);
LongWritable key = new LongWritable();
VectorWritable value = new VectorWritable();
while (reader.next(key, value)) {
	NamedVector namedVector = (NamedVector)value.get();
	RandomAccessSparseVector vect = (RandomAccessSparseVector)namedVector.getDelegate();
 
	for( Element  e : vect ){
		System.out.println("Token: "+dictionaryMap.get(e.index())+", TF-IDF weight: "+e.get()) ;
	}
}
reader.close();

The important things to get in that code are the following:

  • namedVector.getName() will contains the documentId
  • e.index() will ontains the index of the token as present in the dictionary output file, so you can get the token itself using
    dictionaryMap.get(e.index())
  • e.get() contains the weight itself

From there you’ll be able easily to plug your code to do whatever you want with the tokens and their weights, like printing the token having the biggest weights in a given document.

It can be insightful to tune your weighting model. E.g. you can quickly observe that typing errors are often getting a super high weight, which makes sense in the TF-IDF weighting scheme (unless the typing error is very frequent in your document collection), and thus you might want to fix that.

It is also useful just to understand a little bit more of how mahout represents the data internally.

Useful links:

A Generic Method For Sorting (Google Collections) Multiset Per Entry Count

February 20th, 2010

I’m regularly using the excellent google collections library (now final and part of the more general guava libraries). One of the data structure I’m using the most is probably the multiset (a.k.a bag). But most of the time, when I need a multiset to track the number of occurrences of particular entries, I almost always also need to know what is the most occurring entry (or the top N occurring entries).

Let’s take a canonical example: as you are parsing a text, you’re inserting each tokens into a multiset to track their number of occurrences and you simply want to know what are the top N most occurring tokens (ok, if you want to do it on terabytes of data, you might want to start learning hadoop :) ).

I need those kind of statistics  so frequently that I was surprised to not find an existing utility method allowing to sort the entries of a multiset per entry count (or number of occurrences). Here is my attempt to do it in a generic short and efficient way:

public <T> List<Entry<T>> sortMultisetPerEntryCount(Multiset<T> multiset){
	Comparator<Multiset.Entry<T>> occurence_comparator = new Comparator<Multiset.Entry<T>>() {
		public int compare(Multiset.Entry<T> e1, Multiset.Entry<T> e2) {
			return e2.getCount() - e1.getCount() ;
		}
	};
	List<Entry<T>> sortedByCount = new ArrayList<Entry<T>>(multiset.entrySet());
	Collections.sort(sortedByCount,occurence_comparator);
 
	return sortedByCount;
}

If you got any other better or most efficient way to do it (or if you know an existing utility method that does it), please share.

If you never used google collections, in addition to the official website, you might find those tutorials (1 and 2)  useful for an introduction.

What Are The 10 Most Cited Websites On Twitter When Tweeting About Hot Trends?

February 6th, 2010

Lately I wrote a post on how to build a relevant real time search engine prototype in few hundreds lines of code.  Using a tailored ranking algorithm based on link popularity in twitter,  I showed that the prototype was able to return very relevant answers in response to very hot queries like the ones that can be found in the hourly updated list of google hot trends.

I wrote a small program on top of this prototype to run an experiment: each hour, the program crawl the new list of hot queries from google hot trends, then it runs the prototype on each of those queries and keep the hottest link found in twitter for the corresponding hot query. I wanted to see which websites were mostly cited in those tweets talking about hot trends.

So I let ran the program for a week, collected the  links (more than a thousand), expanded all those into their long URLs version (using an improved version of my java universal URL expander),  extracted the domain names and compiled the whole into a top 10 list of the most cited websites. Here it is (click to enlarge):

top10twitterBuzzWebsites

The Most Cited Websites When Tweeting About Hot Trends. Click to enlarge.

I was surprised to see some websites that I’ve never heard about before (like wpparty.com or actionnewsblast.com).

To have a better idea for which kind of hot queries/topics those websites are most cited in twitter, find below, for each of those top website, a sample of 5 google hot trends query they covered last week.

Website Sample of 5 covered google hot trends of this past week
www.cnn.com 2011 budget
ipad tablet
cnn.com/haiti360
concorde crash
federer murray
sports.espn.go.com federer tsonga australian open
aaron miles
tom brookshier
jackson jeffcoat
paul pierce
wpparty.com jackson jeffcoat
leon russell
wanamaker mile
buffalo exchange
recalled toyotas
www.huffingtonpost.com governor of virginia
obama republican retreat
obama gop
apple tablet announcement
groundhog prediction
twitpic.com miss america 2010 winner
what celeb do i look like
footprints in the sand
apple itablet
itablet
www.youtube.com i pad
grammy awards 2010
bob kellar
lakers celtics
ipad a disappointment
www.facebook.com general beauregard lee
roberta flack
action express racing
slightly stoopid
rolex 24 hours daytona
www.actionnewsblast.com codswallop meaning
jonathan antin
fred baron
codswallop definition
stevie nicks
www.netnewsticker.com arc energy
ego ferguson
kim burrell
reserveamerica
ivan mccartney
mashable.com national lady gaga day
ipad tablet
ipad thoughts
doppelganger week facebook
tebow super bowl ad

Few remarks:

  • All the links spotted by my prototype and that appear in the table are coming from real tweets around those google hot trends queries.
  • You’ll notice that apple iPad announcement is a theme that was covered by 4 of those top 10 websites!
  • I recommend you to have a look on the youtube video in the table around the google hot trend “ipad a disappointment” :) .
  • I also recommend you to have a look at the haiti 360 view covered by cnn.
  • For twitpic, it is only pics, so what you’ll find there is a sample of “trendy pics” (see below for more on that…)
  • Sometimes the hot query seems to be not connected with the related article at first view (like with fred baron). But when you take a closer look, there is always a connection! This is not for nothing that people tweet about a link with the text of the hot query in the tweet…

To finish, find below a picasa collage that I built using the most cited twitpic pictures in twitter for this past week of hot trends (not only the 5 cited in the table). You’ll identify easily some sarcastic pictures before the iPad announcement or pics around the election of Miss USA. Click the picture to enlarge.

picasaCollageTopPics

Collage of the most cited twitpic links in twitter for a week of google hot trends (Click to enlarge)

If you’re curious to map some pictures with its related hot topic, click the collage to enlarge it and try to guess which pics correspond to which google hot query below :) .

miss america 2010 winner, what celeb do i look like, miss america 2010, roberta flack, lady gaga and elton john, addicted to love, jim florentine, apple itablet, lost season 6 premiere, candy crowley, to make you feel my love, swagger crew, footprints in the sand, gasparilla, miss virginia, duke georgetown, celebrity look alike, katherine putnam, itablet, andrea bocelli, monster diesel, peta ad.

Hadoop Tutorial Series, Issue #4: To Use Or Not To Use A Combiner

January 14th, 2010

combinerWelcome to the fourth issue of the Hadoop Tutorial Series. Combiners are another important Hadoop’s feature that every hadoop developer should be aware of. The primary goal of combiners is to optimize/minimize the number of key value pairs that will be shuffled accross the network between mappers and reducers and thus to save as most bandwidth as possible.

Indeed, to give you the intuition of why combiner helps reducing the number of data sent to the reducers, imagine the word count example on a text containing one million times the word “the”. Without combiner the mapper will send one million key/value pairs of the form <the,1>. With combiners, it will potentially send much less key/value pairs of the form <the,N> with N a number potentially much bigger than 1. That’s just the intuition (see the references at the end of the post for more details).

Simply speaking a combiner can be considered as a “mini reducer” that will be applied potentially several times still during the map phase before to send the new (hopefully reduced) set of key/value pairs to the reducer(s). This is why a combiner must implement the Reducer interface (or extend the Reducer class as of hadoop 0.20).

In general you can even use the same reducer method as both your reducer and your combiner. This is the case for the word count example where using a combiner remains to add a single line of code in your main method:

conf.setCombinerClass(Reduce.class);

where conf is your JobConf, or, if you use hadoop 0.20.1:

job.setCombinerClass(Reduce.class);

where job is your Job built with a customized Configuration.

That sounds pretty simple and useful and at first look you would be ready to use combiners all the time by adding this simple line, but there is a small catch. The first kind of reducers that comes naturally as a counter example of using combiner is the “mean reducer” that computes the mean of all the values associated with an given key.

Indeed, suppose 5 key/value pairs emitted from the mapper for a given key k: <k,40>, <k,30>, <k,20>, <k,2>, <k,8>. Without combiner, when the reducer will receive the list <k,{40,30,20,2,8}>, the mean output will be 20, but if a combiner were applied before on the two sets (<k,40>, <k,30>, <k,20>) and (<k,2>, <k,8>) separately, then the reducer would have received the list <k,{30,5}> and the output would have been different (17.5) which is an unexpected behavior.

More generally, combiners can be used when the function you want to apply is both commutative and associative (that’s pretty intuitive to understand why). That’s the case for the addition function, this is why the word count example can benefit from combiners but not for the mean function (which is not associative as shown in the counter example above).

Note that for the mean function you can use a workaround for using combiners by using two separate reduce methods, a first one that would be used as the addition function (and thus that can be set as the combiner) that would emit the intermediate sum as the key and the number of addition involved as the value, and a second reduce function that would compute the mean by taking into account the number of addition involved (see the references for more details on that).

As usual in this series, let’s observe the lesson learned in action using our learning playground. For that you can use the original word count example (or its hadoop 0.20.1 version that we used in the previous issue), add it the single combine line as specified earlier in the post and run it on our moby-dick mascot. Here what we can see at the end of the execution:

combiner

Output of the word count example when using a combiner. Click to enlarge.

Now that you understand what counters are, if you click to enlarge the picture, you’ll see the value of two counters: Combine input records=215137 and Combine output records=33783. That’s a pretty serious reduction of the number of key/value pairs to send to the reducers. You can easily imagine the impact for much larger jobs (see the reference below for real numbers).

Enjoy combiners, whenever you can…

References

  • See the 4th tip of this must read blog post by Todd Lipcon for feeling better the benefit of combiners on a 40GB wordcount job benchmark.
  • For a deeper understanding of when and how combiners are used in the mapReduce data flow, check this section of the (quiet heavy but) excellent Yahoo! hadoop tutorial.
  • To extend the intuition given in the post on why combiners help, you can go over this walk-through.
  • Both Hadoop the definitive guide and Hadoop in Action contains interesting information on combiners (part of both of them inspired this post). In particular the first contains a great section on when exactly the combiners comes into play in the mapReduce data flow. The second contains a full code of the mean function workaround mentioned above.

Hadoop Tutorial Series, Issue #3: Counters In Action

January 7th, 2010

Note: This post has been updated with a code working for hadoop 0.20.1.

In this 3rd issue of the hadoop tutorial series, we’ll speak about a very simple but very useful hadoop’s feature: counters.

Even if you have never defined any counters in hadoop, you can see some of them each time you are running an hadoop job. Indeed, here is what you can see from the client console at the end of the execution of a job (can also be seen from the web interface):

counters

Hadoop internal counters at the end of a job (Click to enlarge).

As you can see, 18 internal counters are presented inside different groups. For instance, you can see a section “Job Counters” with three different counters giving basic information about the job like the number of mappers and reducers. In that example, “Job Counters” is called the group of the counter and “Launched reduce tasks” (for instance) is properly the name of the counter.

It is very handy to define your own counters to track any kind of statistics about the records you are manipulating in the mapper and the reducer. The most natural use of that is to use counters to track the number of malformed records.

If you are executing a job  and you see an abnormally high number of malformed records, it can give a good hint that you perhaps have a bug in your code or some problem with your data (note this is actually a much simpler way to spot issues than tracking error messages in a distributed set of log files). But you can actually use counters for any kind of other statistics on your records.

One easy way to define your own counters from your Java code is:

  • Declaring an enum representing your counters. The enum name is the group of the counter, and each field of the enum is the name of the counter that will be reported in this same group
  • Incrementing the desired counters from your map and reduce methods through the Context of your mapper or reducer (in previous hadoop version it was through the Reporter.incrCounter() method, but the reporter no longer exists in hadoop 0.20)

So let’s see an example. We’ll take the word count example revised for version 0.20.1 to illustrate the use of counters. We will create a counter group called WordsNature that will count how many unique tokens there is in all, how many unique tokens starts with a digit and how many unique tokens starts with a letter.

So our enum declaration will look like that:

 static enum WordsNature { STARTS_WITH_DIGIT, STARTS_WITH_LETTER, ALL }

We will also need a very basic StringUtils class:

package com.philippeadjiman.hadooptraining;
 
public class StringUtils {
 
	public static boolean startsWithDigit(String s){
		if( s == null || s.length() == 0 )
			return false;
 
		return Character.isDigit(s.charAt(0));
	}
 
	public static boolean startsWithLetter(String s){
		if( s == null || s.length() == 0 )
			return false;
 
		return Character.isLetter(s.charAt(0));
	}
 
}

Since we are interested in unique tokens, we will put the code related with the counter into the reduce method. So here how the reduce method will look like:

public void reduce(Text key, Iterable values, Context context)
	throws IOException, InterruptedException {
 
	int sum = 0;
	String token = key.toString();
	if( StringUtils.startsWithDigit(token) ){
		context.getCounter(WordsNature.STARTS_WITH_DIGIT).increment(1);
	}
	else if( StringUtils.startsWithLetter(token) ){
		context.getCounter(WordsNature.STARTS_WITH_LETTER).increment(1);
	}
	context.getCounter(WordsNature.ALL).increment(1);
	for (IntWritable value : values) {
		sum += value.get();
	}
	context.write(key, new IntWritable(sum));
}

Here is the code of the WordCountWithCounter that include this code.

If you want to run it inside our learning playground you’ll just have to update the pom with hadoop latest version:

<dependency>
<groupId>org.apache.mahout.hadoop</groupId>
<artifactId>hadoop-core</artifactId>
<version>0.20.1</version>
</dependency>

So here is the result after running the code with, as input, the whole text of moby dick:

jobResultsWithCounters

We can now see our home made counters. (Click to enlarge)

So we can see now that we have 33783 unique tokens, 32511 starting with a letter and 263 starting with a digit. What about the 1009 others?? Well, because the word count example use a basic StringTokenizer that splits tokens at spaces, a lot of words simply starts with a ‘(’ or with a ‘[’ and even with ‘–’. To solve that you can for instance use a lucene StandardAnalyzer.

You should now be able to easily implements your own counters for tracking bad records/missing values, debugging or gathering any kind of statistics around your job.

See you soon for another issue…