January 24, 2011


There is of course the option of revenge, which works only if carried off with finesse. A recent news item reported such a turning of tables by a parent woken up at 4AM by an automated phone call from the school system. More appropriate to this post are the many celebrations of revenge in book and film, an exemplar of which is Sydney Pollack's Absence of Malice

The temptation to begin this comment "Before I stopped robbing banks..." is strong, but ultimately I'm a wimp. I haven't stopped robbing banks. I mean, I never started robbing them! Oh, crap.

How can I disagree with a call for better education and for a clearer awareness of what makes search engines tick? There is also need for a complementary adjustment: the technology itself needs to work better. Lately, Google itself has taken a few hits (pun?) for producing results laden with spam, a problem purportedly addressed more thoughtfully by young upstart engines. But the fact of the matter is that these search engines have never been useful outside a fairly narrow range of simple research projects, such as shopping.

The call for "balance," however, is misplaced. It can't be produced technologically, at least not without also misrepresenting the data. The state of affairs in the world is what your diagram depicts. That is balance. Nobody is "bombarded" with anything. These are just the results. Compare shopping online. If I'm shopping for coffee makers and my searching produces dozens or hundreds of models--multiple color options, functions and features, price points--have I been bombarded with results? Or have I been presented with an opportunity for choice?

But a picture of Dawkins on your desk at a childcare facility. That's weird.

Here's a personal counter-example from the last two days - of bad news being superceded by good. Please excuse it if you find it tangential.
I did a search on 'Manu Joseph' merely to check on the man's credentials and was surprised to find a link to my adverse review of his book at the bottom of the first page of results. A few hours later I repeated the search and the link had risen to number two. A few minutes ago I found that it was now at the bottom of the second page.
What's going on Cyrus? Does this mean that the rank of a link on Internet search results is to be modeled as a Random Walk process? Could the process of ranking be Markovian and, by implication, memoryless?
Perhaps an easier tactic than revenge is to flood the net with feel-good items in sufficient number. Would that be a workable remedy? If so, what are its parameters?
If you have a scientific explanation I'll gladly have it off-line.(

Since I have had no response from Cyrus let me add that I don't know what the fuss is about. Google comes up with NOTHING pertinent to my searches with all possible combinations of keywords from my Sotomayor post of three years ago or less. In effect, it is lost to posterity.

Sorry not to get back to you until now Narayan. To answer your question, the way that Google handles requests can result in different results, even from second to second. Basically, various sets of machines that have slightly different copies of the index used to generate results are used to handle queries. Each query that is made is routed semi-deterministically to one of these clusters, such that the same query can produce different results at different times. If a query doesn't have that many hits, or if the hits are similarly ranked, small differences in these clusters can account for the changes in rank you report.

It's interesting that you bring up the concept of random walks, as they play the central role in how Google's page rank algorithm works. Indeed, the page rank of a document is, more or less, the probability that an infinitely long random walk across the entire web will finish on that particular document. This is a memoryless process in the statistical sense, but that can give the mis-impression that topology doesn't matter. To the contrary, topology plays a very important role, as a document with more incoming links is much more likely to be hit by a random walk than one with only a few links. If a document is closely connected with a highly connected site, then it too is more likely to be reached by the walk.

This provides a partial answer to your question: flooding is hard, as for the flooding to be effective two things need to be true: first, the pages you add need to be linked to by others. If they sit there, all by themselves, they'll have almost no effect on the final page rank of the target document. Second, they need to be linked to by more than just the other flood documents, else they will still have very little effect. Indeed, Google filters for graph components that are internally well connected, but contain almost no links from the outside. I'm not familiar with how that process works, but the idea is to protect from flooding, such that what you suggest isn't possible.

Page rank is only one of the many factors Google now uses to generate the listings you see, but I believe it remains one of the most important.

To summarize, one negative post isn't going to slander someone with a top position result in Google. But a media blitz of bad publicity over accusations would certainly dent search results for a long time to come, particularly if the person affected isn't normally someone who produces news (i.e., a non-celebrity). I would guess it is a marginal situation, but life must be hell for those it affects.

A bit more on the emerging European concept of the right to be forgotten:

Very interesting, Cyrus. So will this be achieved upon explicitly requesting that certain data be removed from the internet? By individuals or the courts? Will this apply to something as inconsequential (albeit embarrassing at a later date and in a new light) as a youthful photo or reckless speech? Or will it be something more life altering such as false criminal charges which later proved to be groundless? I can understand the "right to forget" being a legitimate quest for the latter scenario but not the first. Europeans may value their privacy more than others but I doubt that it is necessary to scrub the 'net of all offending (but harmless) items.

I think that what you describe is currently lacking in the media, and derivatively, Google, exists already: it's Wikipedia.

Sure, it also has its own biases, but given that the media works better on negative news than positive follow ups, people should just go to your Wikipedia page to check if the accusation your child care facility got are really true or not. Wikipedia is almost obsessive in recording the most up-to-date facts.

Of course it's not the best example, since the way to handle biographies of living people is quite a heated topic of debate within Wikipedia itself, but for sure I think that crowdsourcing would help balancing the way these things are portrayed in the media.

