Thursday, January 25, 2007

Leaking Information

Tuesday earlier this week, I gave a presentation on the topic of Comparing Information Without Leaking It.  This is a class of problems in which two parties are interested in ensuring they both know some secret so they can then safely talk about it.  Wouter Teepe wrote his Ph.D. Thesis on an extension of this problem, and has a number of excellent ideas as to where, potentially, algorithms which solve this class of problems might be used.  A particularly memorable one concerned the recent EU - USA spate about air passenger data.  The EU wants to avoid leaking it, and the US wants to compare it to their terror lists.  Well, that's just an instance of Comparing Information Without Leaking It, so using existing algorithms that can be achieved - technically.  I'm not holding my breath on the politics however, especially, as I suspect, the US is actually just as interested in data mining this information (and using it for a variety of not entirely noble causes), as it is in identifying known suspects.  Of course, data mining isn't a perfect science for finding terrorists...

In any case, I've been casually considering various other information leaks as a result.  Here's an irritating one I can't fix easily:  PicasaWeb uses your Google username as an element of your web album URL.  For example, a recently uploaded Panorama is at http://picasaweb.google.com/eamon.nerbonne/Panoramas.  I can't easily change that URL (and certainly not after posting links to it everywhere), and the only options you can change it too are other google usernames you have.  Well, that's great, but those also coincide with your gmail google address.  In conclusion, I'm looking forward to more spam promptly; because it's trivial to harvest picasaweb of valid gmail addresses now.

Thursday, January 04, 2007

Google isn't evil (part 372928)

A long, long time ago in a galaxy far, far away there once was a great evil empire.  Well, I'm not sure if it was evil, nor whether it was an empire, but nobody cares - if it was great, it must have been!  As such, the Great Google is being carefully scrutinized all around the web for symptoms: has Google finally succumbed?

I recently read yet another worry by Dare Obasanjo about how Google has clearly drifted from the path of the light, and that this is to be expected given it's conflicts of interest (which he calls a strategy tax).  Three examples are given.  Google's income is dependant on ad-clicks, and these would potentially decline should normal search results become better, discouraging Google from improving search any more.  Second, Google's (temporary?use of "tips" - which are essentially ads that function like Google's normal adwords-based advertisements, but point to google's properties like picasa web albums - conflict with it's core advertising business.  Finally, Dare suggests that google's may be fudging its search results, because, how else could a search for "mail" return gmail as a top hit, which is only linked 26500 times while yahoo mail is on spot 2 with 400000 links?

It's hard to imagine any objective metric that should make Gmail show up ahead of Yahoo! Mail in a search for the word "mail". Of course, this doesn't mean that Google is tampering with search results "by hand". Their algorithm can simply have allowances to rank sites in their domain or linked from their domain higher without having to actually sully their hands by tweaking individual results by hand. Still, if Google is how the world finds information and we are increasingly being pointed to information that financially benefits Google, doesn't that taint the much vaunted claim of the integrity of their search results even if it is being done in an automated manner?

On the topic of the search results for "mail": I think it's just much ado about nothing! Looking at the top search results on yahoo, Google and live, Google's are clearly more relevant, being the only one to name Hotmail, yahoo mail, gmail and mail.com right near the top. Google is growing very quickly and it's reasonable to be suspicious but a far simpler explanation is that Google's ranking algorithm is different - and superior (or maybe not?) - than that of the others for most queries.

There's also a simple reason why mail.yahoo.com has an enormous amount of irrelevant links: yahoo spams the mail of it's own users adding a tagline which contains a backlink to mail.yahoo.com. Furthermore, it's listed on all kinds of yahoo.com subdomains. If you just page through the actual results, most of the results past the first few pages are yahoo.com's own pages, which don't contain any content whatsoever related to yahoo mail beyond a menu button getting you there. There are only 268 results Google actually displays linking to mail.yahoo.com - the rest it considers duplicates. Links to gmail make it to 604, and far more of those are "real" sites, quite a few of which are blog posts quite like your Dare's - or this one ;-). Any good ranking algorithm _should_ weigh gmail as more controversial, recently talked about, mentioned, linked... and thus more relevant.

MapQuest vs. Google maps (another example Dare lists) follows in similar vein: although it's very likely that MapQuest, with it's longer history has many more links, what does that mean?  100000 links from 2001 (ancient history) on sites which are out of date and possibly full of dead links, on which potentially the only updated content is the ad-banner Google itself provides do not imply a great deal of relevance. It's not surprising hype-master google's doesn't need to fudge its search results to achieve that kind of relevance - people just don't shut up about them! 

Finally, let's not forget that extracting meaning from a huge number of badly formatted, search-engine-"optimized", Google-gaming HTML documents isn't an exact science. It's not particularly convincing to see large swath's of the internet trying to interpret every piece of circumstantial, superficial evidence they find as some form of corruption.

It's ironic and slightly disturbing that google's is adding ads to their own websites, since the relative calm on the google's site was its most obvious killer feature at first. But that's pretty much as bad as it gets: Slightly Disturbing.  Or should I say: Mostly Harmless?

If you're that worried about corruption, a better place to look would be to a certain government, which is adept at wasting almost 6 times the market capitalization of google's each year (while pretending to be losing merely 2).  I guess the motto of the outgoing republican congress can best be described as So long, and thanks for all the fish!  Now we all get to see whether the other guys can do any better...