Picking a web page at random on the Web

To do statistics over the Web, we need samples. Thus, I want to know how to pick a Web page at random, without making much effort. If you are Google or Microsoft, it is easy. But what about the rest of us? And what if I want to pick users at random on Facebook?

In effect, I want to sample a virtually infinite data set: I consider the Web to be infinite because I cannot enumerate all elements in it. Given finite resources against an infinite data set, can I sample fairly?

Thankfully, the objects in this set are linked and indexed. Hence, I came up with two sampling strategies:

  • Using the hyperlinks: Start with any Web page. Recursively follow hyperlinks until you have many Web pages. Pick a page at random in this set. (Mathematically, I am sampling nodes in an infinite graph.)
  • Using the index: Sampling items Pick a set of common words, enter them into a search engine. Pick a page at random in the result set.

Are there any other strategies? What can I say (if anything) about the biases of my sampling?

Further reading: Benoit, D. and Slauenwhite, D. and Schofield, N. and Trudel, A., World’s First Class C Web Census: The First Step in a Complete Census of the Web, Journal of Networks 2 (2), 2007.

Update: An anonymous reader points me to Ziv Bar-Yossef and Maxim Gurevich, Random sampling from a search engine’s index, JACM 55 (5), 2008.

Update 2: Chris Brew points me to Henzinger et al., On Near-Uniform URL Sampling, WWW 2000.

Update 3: Regarding Facebook, I found Gjoka et al., Unbiased Sampling of Facebook, 2009.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

12 thoughts on “Picking a web page at random on the Web”

  1. I think an idea may be to just search a search engine for a single letter instead of a list of words and then choose a random url from a random page from the entire set of results.

    That may avoid the issue of having the search engine assign a “relevancy” score as well being biased to sites with specific words.

  2. Davids suggestion is not really random as you will only be able to find pages that have been bit.ly’d and this is probably mainly true for relatively new sites.

  3. Rather than picking common words I have done this by picking from one to three random words, and then using the “lucky” button (via a direct url call) button on Google. This *appears* to give many more unusual pages better visibility – and it’s humorous in wondering how certain random combinations result in redirection to some pages.

  4. I think that a random walk can be a bad idea, if implemented naively. If memory serves, the web has at a high level, three types of pages
    1) pages with only out links
    2) pages with only in links
    3) pages with both

    This concept can be extended to not just pages, but entire subgraphs. Once you get into some parts of the subgraph you can never get out, thus capturing your random walker. However, without a very large index, it would be difficult to identify these parts of the graph.

    Another alternative is to buy a copy of the web. See ClueWeb09 for 1B web pages for less than $1K

  5. The idea of using a Metropolis Hastings Algorithm came to my mind.

    Say your webgraph is your Markov chain graph. And you want to sample randomly. one idea is to start with an initial seed and run your Metropolis Hastings algorithm with a burn in period and it will give you a state randomly and normally distributed around your seed.

    Similar idea can be used for uniform distribution 🙂

  6. The Yahoo! generator is pretty cool. It seems to only show top-level pages, though.

    Given that many web pages are actually generated through parameters (see http://www.epinions.com/Refrigerators for just one example), the number of different pages a site can offer rises with the number of parameter combinations. How does one define “uniform sampling” in that context?

Leave a Reply to Daniel Gayo Cancel reply

Your email address will not be published.

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

You may subscribe to this blog by email.