To do statistics over the Web, we need samples. Thus, I want to know how to pick a Web page at random, without making much effort. If you are Google or Microsoft, it is easy. But what about the rest of us? And what if I want to pick users at random on Facebook?
In effect, I want to sample a virtually infinite data set: I consider the Web to be infinite because I cannot enumerate all elements in it. Given finite resources against an infinite data set, can I sample fairly?
Thankfully, the objects in this set are linked and indexed. Hence, I came up with two sampling strategies:
- Using the hyperlinks: Start with any Web page. Recursively follow hyperlinks until you have many Web pages. Pick a page at random in this set. (Mathematically, I am sampling nodes in an infinite graph.)
- Using the index: Sampling items Pick a set of common words, enter them into a search engine. Pick a page at random in the result set.
Are there any other strategies? What can I say (if anything) about the biases of my sampling?
Further reading: Benoit, D. and Slauenwhite, D. and Schofield, N. and Trudel, A., World’s First Class C Web Census: The First Step in a Complete Census of the Web, Journal of Networks 2 (2), 2007.
Update: An anonymous reader points me to Ziv Bar-Yossef and Maxim Gurevich, Random sampling from a search engine’s index, JACM 55 (5), 2008.
Update 3: Regarding Facebook, I found Gjoka et al., Unbiased Sampling of Facebook, 2009.