<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Java. Internet. Algorithms. Ideas. &#187; twitter</title>
	<atom:link href="http://philippeadjiman.com/blog/tag/twitter/feed/" rel="self" type="application/rss+xml" />
	<link>http://philippeadjiman.com/blog</link>
	<description>Just Another Blog About Geek Stuff, by Philippe Adjiman</description>
	<lastBuildDate>Tue, 25 May 2010 06:58:19 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.4</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>What Are The 10 Most Cited Websites On Twitter When Tweeting About Hot Trends?</title>
		<link>http://philippeadjiman.com/blog/2010/02/06/what-are-the-10-most-cited-websites-on-twitter-when-tweeting-about-hot-trends/</link>
		<comments>http://philippeadjiman.com/blog/2010/02/06/what-are-the-10-most-cited-websites-on-twitter-when-tweeting-about-hot-trends/#comments</comments>
		<pubDate>Sat, 06 Feb 2010 18:15:09 +0000</pubDate>
		<dc:creator>padjiman</dc:creator>
				<category><![CDATA[experiments]]></category>
		<category><![CDATA[google trends]]></category>
		<category><![CDATA[twitter]]></category>

		<guid isPermaLink="false">http://philippeadjiman.com/blog/?p=907</guid>
		<description><![CDATA[Lately I wrote a post on how to build a relevant real time search engine prototype in few hundreds lines of code.  Using a tailored ranking algorithm based on link popularity in twitter,  I showed that the prototype was able to return very relevant answers in response to very hot queries like the ones that can [...]]]></description>
			<content:encoded><![CDATA[<p>Lately I wrote a post on<a href="http://philippeadjiman.com/blog/2010/01/06/how-to-build-a-relevant-real-time-search-engine-prototype-in-few-hundred-lines-of-code/" target="_blank"> how to build a relevant real time search engine prototype in few hundreds lines of code</a>.  Using a tailored ranking algorithm based on link popularity in twitter,  I showed that the prototype was able to return very relevant answers in response to very hot queries like the ones that can be found in the hourly updated list of <a href="http://www.google.com/trends/hottrends?sa=X" target="_blank">google hot trends</a>.</p>
<p>I wrote a small program on top of this prototype to run an experiment: each hour, the program crawl the new list of hot queries from google hot trends, then it runs the prototype on each of those queries and keep the hottest link found in twitter for the corresponding hot query. I wanted to see which websites were mostly cited in those tweets talking about hot trends.</p>
<p>So I let ran the program for a week, collected the  links (more than a thousand), expanded all those into their long URLs version (using an improved version of my <a href="http://philippeadjiman.com/blog/2009/09/07/the-trick-to-write-a-fast-universal-java-url-expander/" target="_blank">java universal URL expander</a>),  extracted the domain names and compiled the whole into a top 10 list of the most cited websites. Here it is (click to enlarge):</p>
<div id="attachment_966" class="wp-caption aligncenter" style="width: 244px"><a href="http://philippeadjiman.com/blog/wp-content/uploads/2010/02/top10twitterBuzzWebsites.jpg" target="_blank"><img class="size-medium wp-image-966 " title="top10twitterBuzzWebsites" src="http://philippeadjiman.com/blog/wp-content/uploads/2010/02/top10twitterBuzzWebsites-234x300.jpg" alt="top10twitterBuzzWebsites" width="234" height="300" /></a><p class="wp-caption-text">The Most Cited Websites When Tweeting About Hot Trends. Click to enlarge.</p></div>
<p>I was surprised to see some websites that I&#8217;ve never heard about before (like wpparty.com or actionnewsblast.com).</p>
<p>To have a better idea for which kind of hot queries/topics those websites are most cited in twitter, find below, for each of those top website, a sample of 5 <a href="http://www.google.com/trends/hottrends?sa=X" target="_blank">google hot trends</a> query they covered last week.</p>
<div>
<table class="pretty" border="0" align="center">
<tbody>
<tr>
<th>Website</th>
<th>Sample of 5 covered google hot trends of this past week</th>
</tr>
<tr>
<td style="text-align: center; "><a href="http://edition.cnn.com/" target="_blank">www.cnn.com</a></td>
<td style="text-align: center; "><a href="http://www.cnn.com/2010/POLITICS/02/01/obama.budget.explainer/index.html?eref=rss_topstories&amp;utm_source=twitterfeed&amp;utm_medium=twitter&amp;utm_campaign=Feed%3A+rss%2Fcnn_topstories+%28RSS%3A+Top+Stories%29" target="_blank">2011 budget</a><br />
<a href="http://www.cnn.com/video/?utm_source=twitterfeed&amp;utm_medium=twitter&amp;utm_campaign=Feed%3A+rss%2Fcnn_freevideo+%28RSS%3A+Video%29#/video/tech/2010/01/27/barnett.ipad.specs.strategy.cnn" target="_blank">ipad tablet</a><br />
<a href="http://www.cnn.com/interactive/2010/01/world/haiti.360/index.html?video=haiti.flv" target="_blank">cnn.com/haiti360</a><br />
<a href="http://www.cnn.com/2010/WORLD/europe/02/02/france.concorde.trial/index.html?eref=rss_topstories&amp;utm_source=feedburner&amp;utm_medium=feed&amp;utm_campaign=Feed%3A+rss%2Fcnn_topstories+%28RSS%3A+Top+Stories%29" target="_blank">concorde crash</a><br />
<a href="http://www.cnn.com/2010/SPORT/01/31/tennis.australia.open.final.federer.murray/index.html" target="_blank">federer murray</a></td>
</tr>
<tr>
<td style="text-align: center;"><a href="http://sports.espn.go.com/" target="_blank">sports.espn.go.com</a></td>
<td style="text-align: center; "><a href="http://sports.espn.go.com/sports/tennis/aus10/news/story?id=4867619&amp;campaign=rss&amp;source=ESPNHeadlines&amp;utm_source=twitterfeed&amp;utm_medium=twitter" target="_blank">federer tsonga australian open</a><br />
<a href="http://sports.espn.go.com/mlb/news/story?id=4877065&amp;campaign=rss&amp;source=MLBHeadlines" target="_blank">aaron miles</a><br />
<a href="http://sports.espn.go.com/nfl/news/story?id=4872278&amp;campaign=rss&amp;source=twitter&amp;ex_cid=Twitter_espn_4872278" target="_blank">tom brookshier</a><br />
<a href="http://sports.espn.go.com/dallas/news/story?id=4869265&amp;campaign=rss&amp;source=ESPNHeadlines" target="_blank">jackson jeffcoat</a><br />
<a href="http://sports.espn.go.com/boston/nba/news/story?id=4881306&amp;campaign=rss&amp;source=ESPNHeadlines&amp;utm_source=twitterfeed&amp;utm_medium=twitter" target="_blank">paul pierce</a></td>
</tr>
<tr>
<td style="text-align: center;"><a href="http://wpparty.com/" target="_blank">wpparty.com</a></td>
<td style="text-align: center; "><a href="http://wpparty.com/2010/01/29/espn-football-henderson-jeffcoat-and-more-battles-usa-news/" target="_blank">jackson jeffcoat</a><br />
<a href="http://wpparty.com/2010/02/01/foghat-and-leon-russell-coming-to-spotlight-29-casino/" target="_blank">leon russell</a><br />
<a href="http://wpparty.com/2010/01/30/lagat-wins-6th-wanamaker-mile/" target="_blank">wanamaker mile</a><br />
<a href="http://wpparty.com/2010/01/26/hey-its-that-studded-blazer-again/" target="_blank">buffalo exchange</a><br />
<a href="http://wpparty.com/2010/02/03/lahood-tells-owners-of-recalled-toyotas-to-stop-driving-vehicles/" target="_blank">recalled toyotas</a></td>
</tr>
<tr>
<td style="text-align: center;"><a href="http://www.huffingtonpost.com" target="_blank">www.huffingtonpost.com</a></td>
<td style="text-align: center; "><a href="http://www.huffingtonpost.com/2010/01/27/bob-mcdonnell-speech-full_n_439508.html" target="_blank">governor of virginia</a><br />
<a href="http://www.huffingtonpost.com/thenewswire/archive/../../2010/01/29/transcript-of-president-o_n_442423.html" target="_blank">obama republican retreat</a><br />
<a href="http://www.huffingtonpost.com/thenewswire/archive/../../2010/01/29/obama-goes-to-the-gop-lio_n_442331.html" target="_blank">obama gop</a><br />
<a href="http://www.huffingtonpost.com/2010/01/26/apple-tablet-announcement_n_436859.html" target="_blank">apple tablet announcement</a><br />
<a href="http://www.huffingtonpost.com/2010/02/02/groundhog-day-prediction-_n_445601.html" target="_blank">groundhog prediction</a></td>
</tr>
<tr>
<td style="text-align: center;"><a href="http://twitpic.com/" target="_blank">twitpic.com</a></td>
<td style="text-align: center; "><a href="http://twitpic.com/10mo6p" target="_blank">miss america 2010 winner</a><br />
<a href="http://twitpic.com/10wusc" target="_blank">what celeb do i look like</a><br />
<a href="http://twitpic.com/10szff" target="_blank">footprints in the sand</a><br />
<a href="http://twitpic.com/zzq98" target="_blank">apple itablet</a><br />
<a href="http://twitpic.com/zz5by" target="_blank">itablet</a></td>
</tr>
<tr>
<td style="text-align: center;"><a href="http://www.youtube.com/" target="_blank">www.youtube.com</a></td>
<td style="text-align: center;"><a href="http://www.youtube.com/watch?v=YFNQE_TzQNI&amp;feature=youtu.be" target="_blank">i pad</a><br />
<a href="http://www.youtube.com/watch?v=XDCeXrZgbjs&amp;feature=youtu.be" target="_blank">grammy awards 2010</a><br />
<a href="http://www.youtube.com/watch?v=mfZ60a1QbCY" target="_blank">bob kellar</a><br />
<a href="http://www.youtube.com/watch?v=KQmtKOOBO2I" target="_blank">lakers celtics</a><br />
<a href="http://www.youtube.com/watch?v=lQnT0zp8Ya4" target="_blank">ipad a disappointment</a></td>
</tr>
<tr>
<td style="text-align: center;"><a href="http://www.facebook.com/" target="_blank">www.facebook.com</a></td>
<td style="text-align: center; "><a href="http://www.facebook.com/AtlantaHistoryCenter/posts/306004651553" target="_blank">general beauregard lee</a><br />
<a href="http://www.facebook.com/photo.php?pid=3865141&amp;l=4a527701cc&amp;id=93944052260" target="_blank">roberta flack</a><br />
<a href="http://www.facebook.com/GRANDAMRoadRacing/posts/312377461577" target="_blank">action express racing</a><br />
<a href="http://www.facebook.com/JimmyKimmelLive/posts/272447853002" target="_blank">slightly stoopid</a><br />
<a href="http://www.facebook.com/permalink.php?story_fbid=273782412571&amp;id=218464326195" target="_blank">rolex 24 hours daytona</a></td>
</tr>
<tr>
<td style="text-align: center;"><a href="http://www.actionnewsblast.com/" target="_blank">www.actionnewsblast.com</a></td>
<td style="text-align: center; "><a href="http://www.actionnewsblast.com/codswallop-codswallop-meaning" target="_blank">codswallop meaning</a><br />
<a href="http://www.actionnewsblast.com/blow-out-star-antin-joins-bravos-shear-genius-as-judge" target="_blank">jonathan antin</a><br />
<a href="http://www.actionnewsblast.com/ex-edwards-aide-money-was-no-object" target="_blank">fred baron</a><br />
<a href="http://www.actionnewsblast.com/codswallop-codswallop-meaning" target="_blank">codswallop definition</a><br />
<a href="http://www.actionnewsblast.com/taylor-swift-grammy-fascinating-fact" target="_blank">stevie nicks</a></td>
</tr>
<tr>
<td style="text-align: center;"><a href="http://www.netnewsticker.com/" target="_blank">www.netnewsticker.com</a></td>
<td style="text-align: center; "><a href="http://www.netnewsticker.com/how-do-i-get-more-energy" target="_blank">arc energy</a><br />
<a href="http://www.netnewsticker.com/jr-ego-ferguson-goes-for-lsu-the-composed-gentleman" target="_blank">ego ferguson</a><br />
<a href="http://www.netnewsticker.com/schoolcraft-womens-team-keeps-rolling" target="_blank">kim burrell</a><br />
<a href="http://www.netnewsticker.com/how-do-i-know-if-i-reserved-a-camping-spot-correctly-online" target="_blank">reserveamerica</a><br />
<a href="http://www.netnewsticker.com/andy-staples-running-analysis-for-2010-signing-day" target="_blank">ivan mccartney</a></td>
</tr>
<tr>
<td style="text-align: center;"><a href="http://mashable.com/" target="_blank">mashable.com</a></td>
<td style="text-align: center; "><a href="http://mashable.com/2010/01/29/national-lady-gaga-day/" target="_blank">national lady gaga day</a><br />
<a href="http://mashable.com/apple-tablet/" target="_blank">ipad tablet</a><br />
<a href="http://mashable.com/2010/01/27/ipad-whats-missing/#comment-31633181" target="_blank">ipad thoughts</a><br />
<a href="http://mashable.com/2010/01/29/doppelganger-week-facebook/" target="_blank">doppelganger week facebook</a><br />
<a href="http://mashable.com/2010/01/26/tim-tebow-super-bowl-ad/" target="_blank">tebow super bowl ad</a></td>
</tr>
</tbody>
</table>
</div>
<p>Few remarks:</p>
<ul>
<li>All the links spotted by <a href="http://philippeadjiman.com/blog/2010/01/06/how-to-build-a-relevant-real-time-search-engine-prototype-in-few-hundred-lines-of-code/" target="_blank">my prototype</a> and that appear in the table are coming from real tweets around those google hot trends queries.</li>
<li>You&#8217;ll notice that apple iPad announcement is a theme that was covered by 4 of those top 10 websites!</li>
<li>I recommend you to have a look on the youtube video in the table around the google hot trend &#8220;ipad a disappointment&#8221; <img src='http://philippeadjiman.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> .</li>
<li>I also recommend you to have a look at the haiti 360 view covered by cnn.</li>
<li>For <a href="http://twitpic.com/" target="_blank">twitpic</a>, it is only pics, so what you&#8217;ll find there is a sample of &#8220;trendy pics&#8221; (see below for more on that&#8230;)</li>
<li>Sometimes the hot query seems to be not connected with the related article at first view (like with <a style="color: #114477; text-decoration: underline;" onclick="javascript:pageTracker._trackPageview('/outbound/article/www.actionnewsblast.com');" href="http://www.actionnewsblast.com/ex-edwards-aide-money-was-no-object" target="_blank">fred baron</a>). But when you take a closer look, there is always a connection! This is not for nothing that people tweet about a link with the text of the hot query in the tweet&#8230;</li>
</ul>
<p>To finish, find below a picasa collage that I built using the most cited twitpic pictures in twitter for this past week of hot trends (not only the 5 cited in the table). You&#8217;ll identify easily some sarcastic pictures before the iPad announcement or pics around the election of Miss USA. Click the picture to enlarge.</p>
<div id="attachment_984" class="wp-caption aligncenter" style="width: 310px"><a href="http://philippeadjiman.com/blog/wp-content/uploads/2010/02/picasaCollageTopPics.jpg" target="_blank"><img class="size-medium wp-image-984" title="picasaCollageTopPics" src="http://philippeadjiman.com/blog/wp-content/uploads/2010/02/picasaCollageTopPics-300x225.jpg" alt="picasaCollageTopPics" width="300" height="225" /></a><p class="wp-caption-text">Collage of the most cited twitpic links in twitter for a week of google hot trends (Click to enlarge) </p></div>
<p>If you&#8217;re curious to map some pictures with its related hot topic, click the collage to enlarge it and try to guess which pics correspond to which google hot query below <img src='http://philippeadjiman.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> .</p>
<p><a href="http://twitpic.com/10mo6p" target="_blank">miss america 2010 winner</a>, <a href="http://twitpic.com/10wusc" target="_blank">what celeb do i look like</a>, <a href="http://twitpic.com/10llra" target="_blank">miss america 2010</a>, <a href="http://twitpic.com/10drx8" target="_blank">roberta flack</a>, <a href="http://twitpic.com/10seq4" target="_blank">lady gaga and elton john</a>, <a href="http://twitpic.com/108d4s" target="_blank">addicted to love</a>, <a href="http://twitpic.com/zuomi" target="_blank">jim florentine</a>, <a href="http://twitpic.com/zzq98" target="_blank">apple itablet</a>, <a href="http://twitpic.com/111q7d" target="_blank">lost season 6 premiere</a>, <a href="http://twitpic.com/10pd73" target="_blank">candy crowley</a>, <a href="http://twitpic.com/zrbar" target="_blank">to make you feel my love</a>, <a href="http://twitpic.com/101scz" target="_blank">swagger crew</a>, <a href="http://twitpic.com/10szff" target="_blank">footprints in the sand</a>, <a href="http://twitpic.com/10j2ao" target="_blank">gasparilla</a>, <a href="http://twitpic.com/10m77i" target="_blank">miss virginia</a>, <a href="http://twitpic.com/10ja2k" target="_blank">duke georgetown</a>, <a href="http://twitpic.com/107dml" target="_blank">celebrity look alike</a>, <a href="http://twitpic.com/10ly58" target="_blank">katherine putnam</a>, <a href="http://twitpic.com/zz5by" target="_blank">itablet</a>, <a href="http://twitpic.com/10swcm" target="_blank">andrea bocelli</a>, <a href="http://twitpic.com/wygnz" target="_blank">monster diesel</a>, <a href="http://twitpic.com/z2wv5" target="_blank">peta ad</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://philippeadjiman.com/blog/2010/02/06/what-are-the-10-most-cited-websites-on-twitter-when-tweeting-about-hot-trends/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>How To Build A Relevant Real Time Search Engine Prototype In Few Hundreds Lines Of Code</title>
		<link>http://philippeadjiman.com/blog/2010/01/06/how-to-build-a-relevant-real-time-search-engine-prototype-in-few-hundred-lines-of-code/</link>
		<comments>http://philippeadjiman.com/blog/2010/01/06/how-to-build-a-relevant-real-time-search-engine-prototype-in-few-hundred-lines-of-code/#comments</comments>
		<pubDate>Wed, 06 Jan 2010 13:30:26 +0000</pubDate>
		<dc:creator>padjiman</dc:creator>
				<category><![CDATA[experiments]]></category>
		<category><![CDATA[google trends]]></category>
		<category><![CDATA[real time web]]></category>
		<category><![CDATA[twitter]]></category>
		<category><![CDATA[twitter API]]></category>

		<guid isPermaLink="false">http://philippeadjiman.com/blog/?p=746</guid>
		<description><![CDATA[By the end of the post you&#8217;ll find the code along with a small command line JAVA program to play with, but let me first describe the specifications of the real time search engine prototype that I&#8217;m targeting here.
Basically it should take as input a  search query and return as output a ranked set of [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://philippeadjiman.com/blog/wp-content/uploads/2010/01/gootter.jpg"><img class="alignleft size-medium wp-image-750" title="gootter" src="http://philippeadjiman.com/blog/wp-content/uploads/2010/01/gootter-300x78.jpg" alt="gootter" hspace="15" width="300" height="78" /></a>By the end of the post you&#8217;ll find the code along with a small command line JAVA program to play with, but let me first describe the specifications of the real time search engine prototype that I&#8217;m targeting here.</p>
<p>Basically it should take as input a  search query and return as output a ranked set of URLs that would correspond to the latest hot news around that search query.</p>
<p>In some way it is similar to what you would expect to find on google news or in one of the dozens real time search engine that were released last year (let&#8217;s cite <a href="http://www.oneriot.com/" target="_blank">oneriot</a>, <a href="http://www.crowdeye.com/" target="_blank">crowdeye</a> and <a href="http://collecta.com/" target="_blank">collecta</a>).</p>
<p>The goal of my prototype is to demonstrate how to leverage <a href="http://twitter.com/" target="_blank">twitter</a> and a simple ranking algorithm to obtain most of the time relevant URLs in response of hot queries, without having to crawl a single web page! As my primary target is relevancy, I won&#8217;t invest any effort on performance or scalability of the prototype (retrieved results will be build at query time).</p>
<h2><strong>High level description of the prototype</strong></h2>
<p>Basically what I did is to use the <a href="http://apiwiki.twitter.com/" target="_blank">twitter API</a> through a java library called <a href="http://yusuke.homeip.net/twitter4j/en/index.html" target="_blank">twitter4j</a> to retrieve all the latest tweets containing the input query <span style="text-decoration: underline;">and</span> that contains a link. For very hot queries, you&#8217;re likely to get a lot of those (I put a limit of the last 150 but you&#8217;ll be able to change it). Once I got my &#8220;link farm&#8221;, what I do is to build a basic ranking algorithm that would rank first the URLs that are the most referenced.</p>
<p>As most of the URLs in tweets are <a href="http://mashable.com/2008/01/08/url-shortening-services/" target="_blank">shortened URLs</a>, the trick is to spot the same URLs that were shortened by different shortening services. For instance both of the following shortened URLs points to a same page of my blog: <a href="http://bit.ly/SmHw6" target="_blank">http://tinyurl.com/yajkgeg</a> and <a href="http://bit.ly/SmHw6" target="_blank">http://bit.ly/SmHw6</a>. It can sounds as a corner case but it actually happens all the time on hot queries. So the idea is to convert all the short URLs in their expanded version. To see how to write an universal URL expander in JAVA that would work for the <a href="http://mashable.com/2008/01/08/url-shortening-services/" target="_blank">90 + existing URL shortening services</a> check the post that is referenced by the two short URLs above.</p>
<p>Note that you can improve the ranking algorithm in tons of way, by exploiting the text in the tweets or who actually wrote the tweet (reputation) or using other sources like <a href="http://digg.com/" target="_blank">digg</a> and much more, but as we&#8217;ll see, even in its simplest form, the ranking algorithm presented above works pretty well.</p>
<h2><strong>Playing with some hot queries</strong></h2>
<p>To find some hot queries to play with, you can for instance take one of the <a href="http://www.google.com/trends/hottrends" target="_blank">google hot trends queries</a> (unfortunately down from <a href="http://philippeadjiman.com/blog/2009/09/27/google-hot-trends-clustering-the-100-hottest-queries-tell-you-about-67-76-stories-in-average/" target="_blank">100 </a>to 40 to 20). Let&#8217;s try with a very hot topic while I&#8217;m writing this post: the google Nexus One phone that was about to be presented to the press two days after I started to wrote this post.</p>
<p>Below I have compiled the results obtained respectively by Google News, OneRiot and my toy prototype on the query &#8220;nexus one&#8221;. Click the picture to enlarge.</p>
<div id="attachment_770" class="wp-caption aligncenter" style="width: 310px"><a href="http://philippeadjiman.com/blog/wp-content/uploads/2010/01/OneRiotGoogleNewsProto_NexusOne.jpg.jpg" target="_blank"><img class="aligncenter size-medium wp-image-782" title="OneRiotGoogleNewsProto_NexusOne.jpg" src="http://philippeadjiman.com/blog/wp-content/uploads/2010/01/OneRiotGoogleNewsProto_NexusOne.jpg-300x182.jpg" alt="OneRiotGoogleNewsProto_NexusOne.jpg" width="300" height="182" /></a><p class="wp-caption-text">Comparing the results on Nexus One. Click to enlarge.</p></div>
<p>I hope you enjoyed my killer UI <img src='http://philippeadjiman.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> . But let&#8217;s focus on the three URLs corresponding of the first result of each one:</p>
<ul>
<li>Google news: <a href="http://business.timesonline.co.uk/tol/business/industry_sectors/technology/article6974688.ece" target="_blank">An article from timeonline</a></li>
<li>OneRiot: <a href="http://news.sky.com/skynews/Home/Technology/Google-Set-To-Reveal-Its-First-Smartphone---The-Nexus-One---With-Hope-To-Challenge-IPhone/Article/201001115513838?f=rss" target="_blank">An article from sky news</a></li>
<li>The prototype: <a href="http://www.engadget.com/2010/01/02/exclusive-google-nexus-one-hands-on-video-and-first-impressio/" target="_blank">An article from engadget</a></li>
</ul>
<p>Given the fact that at the time I issued the query, the Nexus one was not yet released, I would say that the article that the prototype found is the best one since it is the only one that present an exclusive video demonstrating the not yet released phone. This is also why so much people were twitting about this link: because it was the best at that precise time! We&#8217;ll see even more in the next section.</p>
<p>Before, let&#8217;s try with another hot query today (in the top 20 hottest queries of google hot trends): &#8220;byron de la beckwith&#8221;.</p>
<p>That time, it is not clear what is the story/news hidden behind that hot query but running it on the prototype gives as the first link the article below (click on the picture if you want to see the full article).</p>
<div id="attachment_789" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.cbsnews.com/blogs/2010/01/04/crimesider/entry6053374.shtml" target="_blank"><img class="size-medium wp-image-789" title="byron de la beckwith" src="http://philippeadjiman.com/blog/wp-content/uploads/2010/01/byron-de-la-beckwith-300x106.jpg" alt="byron de la beckwith" width="300" height="106" /></a><p class="wp-caption-text">First ranked result by the prototype for the query &quot;byron de la beckwith&quot;. Click to follow the article.</p></div>
<p>Again this is a very relevant result (oneRiot and Google News gave the same one at that time).</p>
<h2><strong>The temporal aspect of hot queries</strong></h2>
<p>What is interesting with hot queries is that you expect the result to change even within a short amount of time. Indeed, any story or breaking news generally evolve as new elements comes in. As promised let&#8217;s follow our &#8220;nexus one&#8221; query.</p>
<p>In the previous section, the prototype&#8217;s first result was a very relevant <a href="http://www.engadget.com/2010/01/02/exclusive-google-nexus-one-hands-on-video-and-first-impressio/" target="_blank">article from engadget</a>. I relaunched the same query, but after 12 hours. The first ranked result returned by my prototype gives me now a different result: still another article from engadget (see picture below), but that time with a much more in depth review of the phone with more videos including a very funny comparison between the android, iphone and nexus one.</p>
<p>Then I waited for Google doing its press conference one day later. I issued the query again. Can you guess what was the first link given by my prototype? You got it, <a href="http://www.google.com/phone/" target="_blank">the official Google Nexus One website</a>.</p>
<div id="attachment_791" class="wp-caption aligncenter" style="width: 310px"><a href="http://www.engadget.com/2010/01/04/nexus-one-review/" target="_blank"><img class="size-medium wp-image-791" title="nexusOneAfter12hours" src="http://philippeadjiman.com/blog/wp-content/uploads/2010/01/nexusOneAfter12hours-300x212.jpg" alt="nexusOneAfter12hours" width="300" height="212" /></a><p class="wp-caption-text">The first link given by the prototype on &quot;nexus one&quot; about one day before its official presentation by Google. Click to follow the article</p></div>
<p>Again this is not a corner case. This temporal aspect happens all the time, for any type of breaking news or events. As a last example of that phenomenon, let&#8217;s take the movie <a href="http://www.avatarmovie.com/" target="_blank">avatar</a>. The first days before and after that the movie were released, all you got is links to see the trailer or even the movie. Now, few weeks after, what you get is a very fast changing list of links around fun pictures of parodies of the movie with title like &#8220;Do you want to date my avatar&#8221; (picture below) or a letter attempting to prove that <a href="http://i.imgur.com/JmRmb.jpg" target="_blank">avatar is  actually Pocahontas in 3d</a> <img src='http://philippeadjiman.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> .</p>
<div id="attachment_792" class="wp-caption aligncenter" style="width: 310px"><a href="http://farm3.static.flickr.com/2674/4246692670_d2a18a0c67_b.jpg" target="_blank"><img class="size-medium wp-image-792 " title="wantYouDataMyAvatar" src="http://philippeadjiman.com/blog/wp-content/uploads/2010/01/wantYouDataMyAvatar-300x240.jpg" alt="wantYouDataMyAvatar" width="300" height="240" /></a><p class="wp-caption-text">Few weeks after the release of the Avatar movie, first links are a fast changing list of parodies</p></div>
<p style="text-align: center;">
<h2><strong><strong>Playing by yourself with the prototype</strong></strong></h2>
<p><strong><strong>If you just want to run the prototype through the command line</strong></strong></p>
<p>You must  have java 6 installed (you can check by opening a console and type java -version). On recent mac, see <a href="http://gephi.org/support/install-java-6-mac-os-x-leopard" target="_blank">those instructions</a> for having java 6 ready to use in a snap.<br />
Then just download this zip archive: <a href="http://philippeadjiman.com/blog/wp-content/uploads/2010/01/JarDependencies.zip">jarsDependencies.zip</a>.<br />
Save it and extract it somewhere in your computer. It will create a directory named <em>prototypeJars</em>.<br />
Open a command prompt. Go inside the directory prototypeJars.</p>
<p>If you are on windows, just type:</p>

<div class="wp_syntax"><div class="code"><pre class="none" style="font-family:monospace;">java -cp &quot;*;&quot; com.philippeadjiman.rtseproto.RealTimeSEPrototype &quot;nexus one&quot; 150 OFF</pre></div></div>

<p>If you are on Linux or Mac just type:</p>

<div class="wp_syntax"><div class="code"><pre class="none" style="font-family:monospace;">java -cp &quot;*&quot; com.philippeadjiman.rtseproto.RealTimeSEPrototype &quot;nexus one&quot; 150 OFF</pre></div></div>

<p>You&#8217;ll notice the three last arguments (all are mandatory):</p>
<ul>
<li>&#8220;nexus one&#8221;: is the query. Type whatever you want here but keep the quotes.</li>
<li>150: is the maximum number of tweets to retrieve from the timeline. Put whatever number between 1 and 1000 but 150 is good enough.</li>
<li>OFF: whether or not you want the prototype to expand the short URLs. If you put ON, you should be patient, it may take a while. Even if duplicate short URLs happen all the time, going with OFF gives a good approximation of which are the leading results. Unless a problem with Twitter, putting OFF should provide you the results within few seconds.</li>
</ul>
<p>Only the top 20 first results will be printed.</p>
<p><strong><strong>If you want to play with the code<br />
</strong></strong></p>
<p>As the title suggests, that just few hundreds lines of (JAVA) code. As it is a toy project and to keep things simple I voluntarily didn&#8217;t use any DI framework like spring or guice and tried to use as less external libraries as possible unless necessary (even no log4j!). I did wrote a minimal amount of unit tests since I cannot code without it and I did use the google-collections library for the same reason <img src='http://philippeadjiman.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> .</p>
<p>Also I tried to wrote at least a minimal amount of comments, in particular where I think the code should be improved a lot for better performance but <strong>remember</strong>: the prototype is of course not scalable as it does not rely on any indexing strategy (it computes the results at query time). Building a real a real search engine would at first involve building an index offline (using <a href="http://lucene.apache.org/" target="_blank">lucene</a> for instance).</p>
<p>You&#8217;ll find the source code here <a href="http://philippeadjiman.com/blog/wp-content/uploads/2010/01/prototype_src.zip">prototype_src.zip</a>.</p>
<p>If you are using maven and eclipse (or other popular IDE), you should be ready to go in less than a minute by unpacking the zip, typing &#8220;mvn eclipse:eclipse&#8221; and importing the existing project.</p>
<h2>Some final remarks</h2>
<p>What I wanted to prove here is mainly that without crawling a single webpage, you can answer to &#8220;hot queries&#8221; with a relevancy comparable to what you can find on google news or any &#8220;real time search engine&#8221;. This is made possible by judiciously using the tremendous power that twitter provide with its open API.</p>
<p>Of course building a real &#8220;real time search engine&#8221; would require much more than few hundred lines of code <img src='http://philippeadjiman.com/blog/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' />  and hundreds of features could be added to that prototype, but I would keep two core principles:</p>
<ul>
<li>real time search results should be links and not micro blogging text like tweets. The text of some tweets can be relevant but as a secondary level of information.</li>
<li>let the &#8220;real time crowd&#8221; do the ranking for you. If a link is related in some way with your query and was highly <span style="text-decoration: underline;">and</span> recently tweeted or digged (you name it), then there is a good chance that it will be a relevant &#8220;real time&#8221; result.</li>
</ul>
<p>In that sense, among the dozens of real time search engines I have tested, my favorite one remains <a href="http://www.oneriot.com/" target="_blank">oneriot</a>.</p>
<p>This is for the &#8220;pull&#8221; side of the things (when the user knows what to search for). I did not talk about the &#8220;push&#8221; side of the real time web here, probably in another post&#8230;</p>
<p>If you have issues running the prototype or any other question/remark, please shoot a comment.</p>
<div id="_mcePaste" style="overflow: hidden; position: absolute; left: -10000px; top: 1582px; width: 1px; height: 1px;">
<h1 id="title"><a class="offsite ct-offbeat" onclick="gotoLink('18258546');" rel="dc:source d31Ebsg" href="http://web.me.com/pascalboogaert/Site/foto3.html">Proof that Avatar is actually Pocahontas in 3D</a></h1>
</div>
]]></content:encoded>
			<wfw:commentRss>http://philippeadjiman.com/blog/2010/01/06/how-to-build-a-relevant-real-time-search-engine-prototype-in-few-hundred-lines-of-code/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The Trick To Write A Fast (Universal) Java URL Expander</title>
		<link>http://philippeadjiman.com/blog/2009/09/07/the-trick-to-write-a-fast-universal-java-url-expander/</link>
		<comments>http://philippeadjiman.com/blog/2009/09/07/the-trick-to-write-a-fast-universal-java-url-expander/#comments</comments>
		<pubDate>Mon, 07 Sep 2009 10:13:13 +0000</pubDate>
		<dc:creator>padjiman</dc:creator>
				<category><![CDATA[java]]></category>
		<category><![CDATA[twitter]]></category>

		<guid isPermaLink="false">http://philippeadjiman.com/blog/?p=81</guid>
		<description><![CDATA[140 characters. Means something to you?
This is about how twitter (and micro-blogging) was born. Even if some profane firefox extensions try to work around this, when it comes to insert (long) urls you may be in trouble to stick to the rule.
And here comes URL shortening services.
Pretty simple: The long URL http://philippeadjiman.com/blog/2009/09/01/can-you-guess-what-is-the-hottest-trend-of-google-hot-trends/ becomes http://bit.ly/miUkz that [...]]]></description>
			<content:encoded><![CDATA[<p style="text-align: left;">140 characters. Means something to you?</p>
<p style="text-align: left;">This is about how twitter (and micro-blogging) <a href="http://www.140characters.com/2009/01/30/how-twitter-was-born/" target="_blank">was born</a>. Even if some profane firefox extensions try to <a href="http://shorttext.com/twitzer.aspx" target="_blank">work around this</a>, when it comes to insert (long) urls you may be in trouble to stick to the rule.</p>
<p style="text-align: left;">And here comes URL shortening services.</p>
<p style="text-align: left;">Pretty simple: The long URL <a href="140 characters. Means something to you?" target="_blank">http://philippeadjiman.com/blog/2009/09/01/can-you-guess-what-is-the-hottest-trend-of-google-hot-trends/</a> becomes <a href="http://bit.ly/miUkz" target="_blank">http://bit.ly/miUkz</a> that will nicely fit in your next tweet.</p>
<p style="text-align: left;">Now everyone wants to shorten URLs. Here is a list of <a href="http://mashable.com/2008/01/08/url-shortening-services/" target="_blank">90 + URL shortening services</a> (!!) without counting the ones that you can <a href="http://lifehacker.com/5335216/make-your-own-url-shortening-service" target="_blank">build by yourself</a>.</p>
<p style="text-align: left;">How we (developers) can survive in this jungle if we want to retrieve the real expended version of those tons of URLs?</p>
<p style="text-align: left;">Well, a naive JAVA version would be:</p>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;"><span style="color: #000000; font-weight: bold;">public</span> <span style="color: #003399;">String</span> NaiveURLExpander<span style="color: #009900;">&#40;</span><span style="color: #003399;">String</span> address<span style="color: #009900;">&#41;</span> <span style="color: #000000; font-weight: bold;">throws</span> <span style="color: #003399;">IOException</span> <span style="color: #009900;">&#123;</span>
        <span style="color: #003399;">String</span> result<span style="color: #339933;">;</span>
        <span style="color: #003399;">URLConnection</span> conn <span style="color: #339933;">=</span> <span style="color: #000066; font-weight: bold;">null</span><span style="color: #339933;">;</span>
        <span style="color: #003399;">InputStream</span>  in <span style="color: #339933;">=</span> <span style="color: #000066; font-weight: bold;">null</span><span style="color: #339933;">;</span>
        <span style="color: #003399;">URL</span> url <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">URL</span><span style="color: #009900;">&#40;</span>address<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
        conn <span style="color: #339933;">=</span> url.<span style="color: #006633;">openConnection</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
        in <span style="color: #339933;">=</span> conn.<span style="color: #006633;">getInputStream</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
        result <span style="color: #339933;">=</span> conn.<span style="color: #006633;">getURL</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>.<span style="color: #006633;">toString</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
        in.<span style="color: #006633;">close</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
        <span style="color: #000000; font-weight: bold;">return</span> result<span style="color: #339933;">;</span>
    <span style="color: #009900;">&#125;</span></pre></div></div>

<p>Nice. It works. But it is terribly slow.<br />
Why?Because when you analyze what happens behind the scene, the HTTP header of the new created short URL contains the line</p>

<div class="wp_syntax"><div class="code"><pre class="bash" style="font-family:monospace;">HTTP<span style="color: #000000; font-weight: bold;">/</span><span style="color: #000000;">1.1</span> <span style="color: #000000;">301</span> Moved</pre></div></div>

<p>If you check the <a href="http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html" target="_blank">status code definition</a> of the HTTP protocol, you will see that means that the URL has moved permanently and that the new one should be located in the <strong>Location</strong> field of the HTTP header. In other words, the above java code behaves exactly as your browser: it performs a redirection, which is terribly slow.</p>
<p>So here is the trick:</p>
<ol>
<li>Use an <strong>HttpURLConnection </strong>object to be able to specify via the <strong>setInstanceFollowRedirects </strong>method to <span style="text-decoration: underline;">not</span> automatically redirect (like a browser will do) while connecting.</li>
<li>Extract the <strong>Location </strong>value in the HTTP header.</li>
</ol>
<p>Here you go:</p>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;"> <span style="color: #000000; font-weight: bold;">public</span> <span style="color: #003399;">String</span> expandShortURL<span style="color: #009900;">&#40;</span><span style="color: #003399;">String</span> address<span style="color: #009900;">&#41;</span> <span style="color: #000000; font-weight: bold;">throws</span> <span style="color: #003399;">IOException</span> <span style="color: #009900;">&#123;</span>
        <span style="color: #003399;">URL</span> url <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">URL</span><span style="color: #009900;">&#40;</span>address<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
        <span style="color: #003399;">HttpURLConnection</span> connection <span style="color: #339933;">=</span> <span style="color: #009900;">&#40;</span><span style="color: #003399;">HttpURLConnection</span><span style="color: #009900;">&#41;</span> url.<span style="color: #006633;">openConnection</span><span style="color: #009900;">&#40;</span><span style="color: #003399;">Proxy</span>.<span style="color: #006633;">NO_PROXY</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span> <span style="color: #666666; font-style: italic;">//using proxy may increase latency</span>
        connection.<span style="color: #006633;">setInstanceFollowRedirects</span><span style="color: #009900;">&#40;</span><span style="color: #000066; font-weight: bold;">false</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
        connection.<span style="color: #006633;">connect</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
        <span style="color: #003399;">String</span> expandedURL <span style="color: #339933;">=</span> connection.<span style="color: #006633;">getHeaderField</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;Location&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
        connection.<span style="color: #006633;">getInputStream</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span>.<span style="color: #006633;">close</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
        <span style="color: #000000; font-weight: bold;">return</span> expandedURL<span style="color: #339933;">;</span>
    <span style="color: #009900;">&#125;</span></pre></div></div>

<p>If you are more a PHP guy, I saw a similar post that explain <a href="http://hasin.wordpress.com/2009/05/05/expanding-short-urls-to-original-urls-using-php-and-curl/" target="_blank">how to do it using PHP and curl</a>.</p>
<p>Note that for sake of conciseness, I do not manage errors int the code. Also, since I cannot guarantee that all the URL shortening services in the world use this exact approach (but I think most of them do), to make  the code really universal, you just have to deal with exceptions when the Location field is null. Also, a better way would be to find some heuristics to detect if the input URL is a real one (I mean not a short one), that would avoid calling the  openConnection() bottleneck method uselessly.</p>
<p>Finally, if some URL shortening services are not robust enough to check their own URLs, you also may have to deal with a corner case of &#8220;transitive shortening&#8221;  (I&#8217;m sure there will be always some curious people that will try to shorten an already shortened URL&#8230;). <strong>Update</strong>: check this example: <a href="http://bit.ly/4XzVxm" target="_blank">http://bit.ly/4XzVxm</a> points to <a href="http://tcrn.ch/6c8AU4" target="_blank">http://tcrn.ch/6c8AU4</a> which is itself another short url!</p>
<p>Also to achieve real performance, such code should be multithreaded. If you have to expand millions of URLs you would probably need to use many machines. Also, a time limit should be added to avoid too long connection, with a mechanism similar to a <a href="http://java.sun.com/j2se/1.4.2/docs/api/java/util/TimerTask.html" target="_blank">TimerTask</a>.</p>
<p>Note that this trick makes the code <strong>5 to 6 times faster</strong>. When it comes to deal with millions of short URLs, it makes a difference.</p>
]]></content:encoded>
			<wfw:commentRss>http://philippeadjiman.com/blog/2009/09/07/the-trick-to-write-a-fast-universal-java-url-expander/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
