<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Picking a web page at random on the Web</title>
	<atom:link href="http://lemire.me/blog/archives/2009/08/14/picking-a-web-page-at-random-on-the-web/feed/" rel="self" type="application/rss+xml" />
	<link>http://lemire.me/blog/archives/2009/08/14/picking-a-web-page-at-random-on-the-web/</link>
	<description>Computer Scientist and Open Scholar: Databases, Information Retrieval, Business Intelligence.</description>
	<lastBuildDate>Thu, 09 Feb 2012 11:13:29 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
	<item>
		<title>By: Seb</title>
		<link>http://lemire.me/blog/archives/2009/08/14/picking-a-web-page-at-random-on-the-web/comment-page-1/#comment-51402</link>
		<dc:creator>Seb</dc:creator>
		<pubDate>Sun, 23 Aug 2009 17:36:28 +0000</pubDate>
		<guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=2107#comment-51402</guid>
		<description>A few good-looking hits stem from this query: http://www.google.com/search?hl=en&amp;q=how+to+sample+uniform+from+the+web&amp;aq=f&amp;oq=&amp;aqi=</description>
		<content:encoded><![CDATA[<p>A few good-looking hits stem from this query: <a href="http://www.google.com/search?hl=en&#038;q=how+to+sample+uniform+from+the+web&#038;aq=f&#038;oq=&#038;aqi" rel="nofollow">http://www.google.com/search?hl=en&#038;q=how+to+sample+uniform+from+the+web&#038;aq=f&#038;oq=&#038;aqi</a>=</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Seb</title>
		<link>http://lemire.me/blog/archives/2009/08/14/picking-a-web-page-at-random-on-the-web/comment-page-1/#comment-51401</link>
		<dc:creator>Seb</dc:creator>
		<pubDate>Sun, 23 Aug 2009 17:30:27 +0000</pubDate>
		<guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=2107#comment-51401</guid>
		<description>The Yahoo! generator is pretty cool. It seems to only show top-level pages, though.

Given that many web pages are actually generated through parameters (see http://www.epinions.com/Refrigerators for just one example), the number of different pages a site can offer rises with the number of parameter combinations. How does one define &quot;uniform sampling&quot; in that context?</description>
		<content:encoded><![CDATA[<p>The Yahoo! generator is pretty cool. It seems to only show top-level pages, though.</p>
<p>Given that many web pages are actually generated through parameters (see <a href="http://www.epinions.com/Refrigerators" rel="nofollow">http://www.epinions.com/Refrigerators</a> for just one example), the number of different pages a site can offer rises with the number of parameter combinations. How does one define &#8220;uniform sampling&#8221; in that context?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Daniel Gayo</title>
		<link>http://lemire.me/blog/archives/2009/08/14/picking-a-web-page-at-random-on-the-web/comment-page-1/#comment-51341</link>
		<dc:creator>Daniel Gayo</dc:creator>
		<pubDate>Sat, 15 Aug 2009 09:29:05 +0000</pubDate>
		<guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=2107#comment-51341</guid>
		<description>Hi,

You should check this Yahoo! service: http://random.yahoo.com/bin/ryl

As far as I know, it produces random pages taken from their index. I used it in the past and, although there certainly shows a bias towards .com pages, it is pretty convenient.

Best, Dani</description>
		<content:encoded><![CDATA[<p>Hi,</p>
<p>You should check this Yahoo! service: <a href="http://random.yahoo.com/bin/ryl" rel="nofollow">http://random.yahoo.com/bin/ryl</a></p>
<p>As far as I know, it produces random pages taken from their index. I used it in the past and, although there certainly shows a bias towards .com pages, it is pretty convenient.</p>
<p>Best, Dani</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Siamak F</title>
		<link>http://lemire.me/blog/archives/2009/08/14/picking-a-web-page-at-random-on-the-web/comment-page-1/#comment-51335</link>
		<dc:creator>Siamak F</dc:creator>
		<pubDate>Fri, 14 Aug 2009 16:16:59 +0000</pubDate>
		<guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=2107#comment-51335</guid>
		<description>The idea of using a Metropolis Hastings Algorithm came to my mind.

Say your webgraph is your Markov chain graph. And you want to sample randomly. one idea is to start with an initial seed and run your Metropolis Hastings algorithm with a burn in period and it will give you a state randomly and normally distributed around your seed.

Similar idea can be used for uniform distribution :)</description>
		<content:encoded><![CDATA[<p>The idea of using a Metropolis Hastings Algorithm came to my mind.</p>
<p>Say your webgraph is your Markov chain graph. And you want to sample randomly. one idea is to start with an initial seed and run your Metropolis Hastings algorithm with a burn in period and it will give you a state randomly and normally distributed around your seed.</p>
<p>Similar idea can be used for uniform distribution <img src='http://lemire.me/blog/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
]]></content:encoded>
	</item>
	<item>
		<title>By: David Buttler</title>
		<link>http://lemire.me/blog/archives/2009/08/14/picking-a-web-page-at-random-on-the-web/comment-page-1/#comment-51334</link>
		<dc:creator>David Buttler</dc:creator>
		<pubDate>Fri, 14 Aug 2009 16:06:35 +0000</pubDate>
		<guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=2107#comment-51334</guid>
		<description>I think that a random walk can be a bad idea, if implemented naively.  If memory serves, the web has at a high level, three types of pages
1) pages with only out links
2) pages with only in links
3) pages with both

This concept can be extended to not just pages, but entire subgraphs.  Once you get into some parts of the subgraph you can never get out, thus capturing your random walker.  However, without a very large index, it would be difficult to identify these parts of the graph.

Another alternative is to buy a copy of the web.  See ClueWeb09 for 1B web pages for less than $1K</description>
		<content:encoded><![CDATA[<p>I think that a random walk can be a bad idea, if implemented naively.  If memory serves, the web has at a high level, three types of pages<br />
1) pages with only out links<br />
2) pages with only in links<br />
3) pages with both</p>
<p>This concept can be extended to not just pages, but entire subgraphs.  Once you get into some parts of the subgraph you can never get out, thus capturing your random walker.  However, without a very large index, it would be difficult to identify these parts of the graph.</p>
<p>Another alternative is to buy a copy of the web.  See ClueWeb09 for 1B web pages for less than $1K</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Chris Brew</title>
		<link>http://lemire.me/blog/archives/2009/08/14/picking-a-web-page-at-random-on-the-web/comment-page-1/#comment-51333</link>
		<dc:creator>Chris Brew</dc:creator>
		<pubDate>Fri, 14 Aug 2009 14:13:26 +0000</pubDate>
		<guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=2107#comment-51333</guid>
		<description>Also this paper

http://www9.org/w9cdrom/88/88.html

which includes methods for balancing out the tendency for random walks to reach well-connected pages more often than poorly-connected ones,</description>
		<content:encoded><![CDATA[<p>Also this paper</p>
<p><a href="http://www9.org/w9cdrom/88/88.html" rel="nofollow">http://www9.org/w9cdrom/88/88.html</a></p>
<p>which includes methods for balancing out the tendency for random walks to reach well-connected pages more often than poorly-connected ones,</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: dan g</title>
		<link>http://lemire.me/blog/archives/2009/08/14/picking-a-web-page-at-random-on-the-web/comment-page-1/#comment-51332</link>
		<dc:creator>dan g</dc:creator>
		<pubDate>Fri, 14 Aug 2009 13:49:20 +0000</pubDate>
		<guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=2107#comment-51332</guid>
		<description>I think an idea may be to just search a search engine for a single letter instead of a list of words and then choose a random url from a random page from the entire set of results.

That may avoid the issue of having the search engine assign a &quot;relevancy&quot; score as well being biased to sites with specific words.</description>
		<content:encoded><![CDATA[<p>I think an idea may be to just search a search engine for a single letter instead of a list of words and then choose a random url from a random page from the entire set of results.</p>
<p>That may avoid the issue of having the search engine assign a &#8220;relevancy&#8221; score as well being biased to sites with specific words.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Anonymous</title>
		<link>http://lemire.me/blog/archives/2009/08/14/picking-a-web-page-at-random-on-the-web/comment-page-1/#comment-51331</link>
		<dc:creator>Anonymous</dc:creator>
		<pubDate>Fri, 14 Aug 2009 13:47:43 +0000</pubDate>
		<guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=2107#comment-51331</guid>
		<description>If you make an assumption that what Google / Bing / anyother search engine indexes is representative of the web, then you can use MCMC sampling methods to produce a random sample of documents from a search engine. See http://portal.acm.org/citation.cfm?id=1411509.1411514</description>
		<content:encoded><![CDATA[<p>If you make an assumption that what Google / Bing / anyother search engine indexes is representative of the web, then you can use MCMC sampling methods to produce a random sample of documents from a search engine. See <a href="http://portal.acm.org/citation.cfm?id=1411509.1411514" rel="nofollow">http://portal.acm.org/citation.cfm?id=1411509.1411514</a></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: thrill</title>
		<link>http://lemire.me/blog/archives/2009/08/14/picking-a-web-page-at-random-on-the-web/comment-page-1/#comment-51330</link>
		<dc:creator>thrill</dc:creator>
		<pubDate>Fri, 14 Aug 2009 13:37:55 +0000</pubDate>
		<guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=2107#comment-51330</guid>
		<description>Rather than picking common words I have done this by picking from one to three random words, and then using the &quot;lucky&quot; button (via a direct url call) button on Google.  This *appears* to give many more unusual pages better visibility - and it&#039;s humorous in wondering how certain random combinations result in redirection to some pages.</description>
		<content:encoded><![CDATA[<p>Rather than picking common words I have done this by picking from one to three random words, and then using the &#8220;lucky&#8221; button (via a direct url call) button on Google.  This *appears* to give many more unusual pages better visibility &#8211; and it&#8217;s humorous in wondering how certain random combinations result in redirection to some pages.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Thomas Deselaers</title>
		<link>http://lemire.me/blog/archives/2009/08/14/picking-a-web-page-at-random-on-the-web/comment-page-1/#comment-51329</link>
		<dc:creator>Thomas Deselaers</dc:creator>
		<pubDate>Fri, 14 Aug 2009 13:37:43 +0000</pubDate>
		<guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=2107#comment-51329</guid>
		<description>Davids suggestion is not really random as you will only be able to find pages that have been bit.ly&#039;d and this is probably mainly true for relatively new sites.</description>
		<content:encoded><![CDATA[<p>Davids suggestion is not really random as you will only be able to find pages that have been bit.ly&#8217;d and this is probably mainly true for relatively new sites.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Thomas Deselaers</title>
		<link>http://lemire.me/blog/archives/2009/08/14/picking-a-web-page-at-random-on-the-web/comment-page-1/#comment-51328</link>
		<dc:creator>Thomas Deselaers</dc:creator>
		<pubDate>Fri, 14 Aug 2009 13:36:09 +0000</pubDate>
		<guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=2107#comment-51328</guid>
		<description>For facebook users it is quite easy to pick randomly:

1. pick a random number $RANDOMNUMBER

2. check if there is somebody at 
 http://www.facebook.com/home.php#/profile.php?id=$RANDOMNUMBER

3. if yes: done,
   if not,  goto 1</description>
		<content:encoded><![CDATA[<p>For facebook users it is quite easy to pick randomly:</p>
<p>1. pick a random number $RANDOMNUMBER</p>
<p>2. check if there is somebody at<br />
 <a href="http://www.facebook.com/home.php#/profile.php?id=$RANDOMNUMBER" rel="nofollow">http://www.facebook.com/home.php#/profile.php?id=$RANDOMNUMBER</a></p>
<p>3. if yes: done,<br />
   if not,  goto 1</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: David</title>
		<link>http://lemire.me/blog/archives/2009/08/14/picking-a-web-page-at-random-on-the-web/comment-page-1/#comment-51327</link>
		<dc:creator>David</dc:creator>
		<pubDate>Fri, 14 Aug 2009 13:34:12 +0000</pubDate>
		<guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=2107#comment-51327</guid>
		<description>Generate a random bit.ly url. It usually ends with 4 to 6 characters and numbers.

for instance, this page is: http://bit.ly/KSZ44

but it&#039;s probably biased toward geek topics..</description>
		<content:encoded><![CDATA[<p>Generate a random bit.ly url. It usually ends with 4 to 6 characters and numbers.</p>
<p>for instance, this page is: <a href="http://bit.ly/KSZ44" rel="nofollow">http://bit.ly/KSZ44</a></p>
<p>but it&#8217;s probably biased toward geek topics..</p>
]]></content:encoded>
	</item>
</channel>
</rss>

