Why bother with Google? Go straight to wikipedia!

Véronis discovered something very interesting. About a third of the time, Google’s results include the Wikipedia link as the first link. His explanation is insightful:

How can this sudden interest in Wikipedia by both engines be explained? It is undoubtedly connected with the increasing difficultly engines have in calculating satisfactory ranking. The good old days of PageRank algorithms are over. (…) The explosion of blogs and news sites has changed the situation considerably.

If Web topology cannot cope anymore, this means we need to introduce time as a factor. Any taker on an hypergraph version of PageRank? How do you call a time-varying Markov process?

Published by

Daniel Lemire

A computer science professor at the Université du Québec (TELUQ).

6 thoughts on “Why bother with Google? Go straight to wikipedia!”

  1. Please note that PageRank is mostly a marketing tool. It has been shown that PR is roughly equal to in-link count. Also, content-based features (like BM25) are largely superior to link-based features.

  2. Thanks for the interesting semi-anonymous comment Sérgio, but it would be even more interesting if you provided references.

    On average, PageRank might be equal to the in-link count, but proving that that it is equal to the in-link count (in some sense) with high probability seems much harder to do. You need to make assumptions about the topology of the Web. I’d be very interested in knowing what these assumptions are.

  3. Daniel,

    Link analysis uses a lot of matrix (2D) algebra. Including time as an additional dimension would require using tensor (3D) decomposition tools. This is what Jimeng Sun and Christos Faloutsos were advocating, eg in their paper and tutorial at the last SDM conference…

    This post by Jean Véronis identifies a *correlation* but his attempt at finding *causality* is highly speculative. Few people really know what is going on behind Google/Yahoo ranking, and those who do, apparently don’t tell easily. A more natural explanation would be that the increased coverage and “authority” of Wikipedia makes it quite naturally rank higher using a number of reasonable results. (Note that the comparison is with a Dec. 2005 study — 2 years ago is a long time in WP time)

    Also, to answer your title question: the Wikipedia search engine is pretty miserable so a regular search engine might be your best bet at efficient search in WP.

  4. How do you call a time-varying Markov process?

    Oh, this I know (for the first time in any blog comment): a nonhomogeneous Markov process. It appears that the canonical reference is: Blackwell D. (1945). Finite nonhomogeneous Markov chains. Ann.
    Math.
    46: 594-599.

  5. If Web topology cannot cope anymore, this means we need to introduce time as a factor.

    Just like… Google Blogsearch!

    I really think the options offered in blog search will be available in the main Google page one day, how soon is a matter I am not able to answer, par contre.

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax