Is PageRank just good marketing?

Web search engines such as Google look at which page links to which page to determine what are the authoritative Web pages. A good algorithm in this context is one that is hard to fool: if you and your friends decide to mutually add link to each others, it should be hard to make much of a difference. Sérgio commented earlier on this blog that PageRank is known to be just a marketing. So I decided to go hunting. Up until now, I thought PageRank was a clever idea because it feels like it would be harder to fool it than just counting how many in-bound link a page has. It was not very long before I found a reference that supported Sérgio’s claim:

Log of indegree was highly correlated with Google-reported PageRank scores, and just as effective when predicting desirable company attributes. Further, we found that PageRank scores for sites within a known spam network were no lower than would be expected on the basis of their indegree. We encounter no compelling evidence to support the use of PageRank over indegree.

Reference: Upstill, T. and Craswell, N. and Hawking, D., Predicting fame and fortune: Pagerank or indegree, ADCS2003, 2003.

Anyone knows of any demonstrated benefit of PageRank over merely counting the number of inbound links? Is PageRank more resilient at all?

Update: do read the comments! They are more interesting than my post.

7 thoughts on “Is PageRank just good marketing?”

  1. IR folks long-suspected PageRank to be a red herring but was not confirmed until the last few years. The reference I like to use comes from MSR and was published at WWW06,

    M. Richardson, A. Prakash, and E. Brill, “Beyond pagerank: machine learning for static ranking,” in WWW ’06: Proceedings of the 15th international conference on World Wide Web, (New York, NY, USA), pp. 707–715, ACM Press, 2006.

    The authors demonstrate that structure-independent features, combined with page’s popularity significantly outperformed PageRank. Informal conversations with engine architects and SEO folks confirms this.

    It’s helpful to interpret these results in the context of a random walk on the web graph. PageRank is the stationary distribution of a random walker on the web graph. In situations where you have no knowledge about page visitation , this is a reasonable surrogate. However, in the presence of real user data (gathered through a toolbar or OS), the random walk model seems less attractive than models which incorporate visitation data.

    That said, it also seems likely that actual effectiveness of search engines has more to do with using massive amounts of click data to train classic IR features and query triage schemes.

  2. Interesting post. I used Google Scholar to find all citations of “Predicting fame and fortune: Pagerank or indegree”. Google found 16 citations:

    I skimmed some of the citations, and two seemed particularly relevant: (1) Hits on the web: how does it compare? (2) Beyond PageRank: Machine Learning for Static Ranking. I was about to post this comment, when I saw that two previous comments gave exactly the same two references. Now I’m posting this comment anyway, to say that Google PageRank may be bogus, but Google Scholar seems to work just fine. 🙂

  3. Just to offer some anecdotal (and unconfirmed) piece of information: it is claimed that the original Pagerank was not exactly the one described in the WWW97 paper.

    In the plain vanilla implementation, the underlying model of Pagerank corresponds to a “random surfer” that follows hyperlinks and with probability 0.85 gets bored and jumps to a random page. I have heard that in the actual implementation, the random surfer jumps only to pages in the “edu” domain. (This idea is similar to the TrustRank algorithm.)

    Of course, since 1996 many things have changed and today there are so many other factors that are taken into consideration during ranking that it is almost certain that PageRank is mainly a marketing tool.

  4. I agree that PageRank has become mainly a marketing tool. However, there is a flaw in Upstill’s work. He doesn’t compare in-degree with PageRank but with the score given in Google’s Toolbar, called “PageRank”. Nobody knows what this score is exactly. In particular, nothing proves that it is the real “pure” PageRank as described in the original PageRank paper. I suspect that it is (a downgraded version of) the score that Google uses for ranking, which is a mixture of many factors, in which PageRank plays some (unknown) role.

  5. True. My comment was not in defence of PageRank. The simple fact that Google need to supplement it with several dozens of other criteria shows that it is not ideal 😉 In a way, Upstill said something right with a disputable methodology.

Leave a Reply

Your email address will not be published. Required fields are marked *