<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Run-length encoding (part 3)</title>
	<atom:link href="http://lemire.me/blog/archives/2009/12/09/run-length-encoding-part-3/feed/" rel="self" type="application/rss+xml" />
	<link>http://lemire.me/blog/archives/2009/12/09/run-length-encoding-part-3/</link>
	<description>Computer Scientist and Open Scholar: Databases, Information Retrieval, Business Intelligence.</description>
	<lastBuildDate>Tue, 07 Feb 2012 23:39:16 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
	<item>
		<title>By: Willfred</title>
		<link>http://lemire.me/blog/archives/2009/12/09/run-length-encoding-part-3/comment-page-1/#comment-52013</link>
		<dc:creator>Willfred</dc:creator>
		<pubDate>Tue, 15 Dec 2009 00:20:02 +0000</pubDate>
		<guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=2312#comment-52013</guid>
		<description>Interesting, simple and effective.</description>
		<content:encoded><![CDATA[<p>Interesting, simple and effective.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Kevembuangga</title>
		<link>http://lemire.me/blog/archives/2009/12/09/run-length-encoding-part-3/comment-page-1/#comment-52011</link>
		<dc:creator>Kevembuangga</dc:creator>
		<pubDate>Sat, 12 Dec 2009 17:25:52 +0000</pubDate>
		<guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=2312#comment-52011</guid>
		<description>&lt;i&gt;I think we are talking about the same thing. &lt;/i&gt;

Then &lt;b&gt;you&lt;/b&gt; missed my point!
I am not proposing a scheme that will resolve the multicolumns sorting conundrum any better than your (or anyone else&#039;s) RLE compression.
What I say is that when working within anyone column during a complex search it is possible to reduce the computation load to be only proportional to the compressed size of the intermediate result (set of row IDs) for this column &lt;b&gt;NOT&lt;/b&gt; to the compressed size of the whole column.
Though that is of practical import only if there isn&#039;t a constant multiplicative factor hidden somewhere in the algorithm I envision which will spoil the better asymptotic performance, many a &quot;theoretically best&quot; algorithm is crippled by this.
All this of course assuming that we are still speaking about columns for which &lt;a href=&quot;http://www.daniel-lemire.com/blog/archives/2009/11/12/which-should-you-pick-a-bitmap-index-or-a-b-tree/&quot; rel=&quot;nofollow&quot;&gt;a bitmap is to be preferred&lt;/a&gt;.</description>
		<content:encoded><![CDATA[<p><i>I think we are talking about the same thing. </i></p>
<p>Then <b>you</b> missed my point!<br />
I am not proposing a scheme that will resolve the multicolumns sorting conundrum any better than your (or anyone else&#8217;s) RLE compression.<br />
What I say is that when working within anyone column during a complex search it is possible to reduce the computation load to be only proportional to the compressed size of the intermediate result (set of row IDs) for this column <b>NOT</b> to the compressed size of the whole column.<br />
Though that is of practical import only if there isn&#8217;t a constant multiplicative factor hidden somewhere in the algorithm I envision which will spoil the better asymptotic performance, many a &#8220;theoretically best&#8221; algorithm is crippled by this.<br />
All this of course assuming that we are still speaking about columns for which <a href="http://www.daniel-lemire.com/blog/archives/2009/11/12/which-should-you-pick-a-bitmap-index-or-a-b-tree/" rel="nofollow">a bitmap is to be preferred</a>.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Daniel Lemire</title>
		<link>http://lemire.me/blog/archives/2009/12/09/run-length-encoding-part-3/comment-page-1/#comment-52010</link>
		<dc:creator>Daniel Lemire</dc:creator>
		<pubDate>Sat, 12 Dec 2009 15:28:49 +0000</pubDate>
		<guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=2312#comment-52010</guid>
		<description>@Kevembuangga Yes, I think we are talking about the same thing. If you only focus on one column at a time, you can indeed index things up very nicely using a (compressed) B-tree or hash table. But we are dealing with a multidimensional list... a table with several columns... that&#039;s much harder to index.</description>
		<content:encoded><![CDATA[<p>@Kevembuangga Yes, I think we are talking about the same thing. If you only focus on one column at a time, you can indeed index things up very nicely using a (compressed) B-tree or hash table. But we are dealing with a multidimensional list&#8230; a table with several columns&#8230; that&#8217;s much harder to index.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Kevembuangga</title>
		<link>http://lemire.me/blog/archives/2009/12/09/run-length-encoding-part-3/comment-page-1/#comment-52009</link>
		<dc:creator>Kevembuangga</dc:creator>
		<pubDate>Sat, 12 Dec 2009 05:35:41 +0000</pubDate>
		<guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=2312#comment-52009</guid>
		<description>&lt;i&gt;The curse of dimensionality ...&lt;/i&gt;

Looks like we are not talking about the same topic, never mind.</description>
		<content:encoded><![CDATA[<p><i>The curse of dimensionality &#8230;</i></p>
<p>Looks like we are not talking about the same topic, never mind.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Daniel Lemire</title>
		<link>http://lemire.me/blog/archives/2009/12/09/run-length-encoding-part-3/comment-page-1/#comment-51990</link>
		<dc:creator>Daniel Lemire</dc:creator>
		<pubDate>Fri, 11 Dec 2009 19:09:18 +0000</pubDate>
		<guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=2312#comment-51990</guid>
		<description>@Kevembuangga

&lt;em&gt;if you were really clever you could have the running time a linear function of ONLY the size of the compressed output&lt;/em&gt;

The curse of dimensionality makes this cleverness extremely difficult unless you expect specific queries.

Consider these queries for example:

&lt;code&gt;
select X*Y*Z where (X*Y-Z&lt;W);
select * where (X*X+Y*Y&lt;Z*Z+W*W+V*V);
&lt;/code&gt;

Certainly, you can find ways to index them. But if I am free to come up with others, you will eventually get in serious trouble.

The lesson here is that there is not one indexing strategy that will always work. 


You may enjoy this earlier blog post:


Understanding what makes database indexes work
http://www.daniel-lemire.com/blog/archives/2008/11/07/understanding-what-makes-database-indexes-really-work/</description>
		<content:encoded><![CDATA[<p>@Kevembuangga</p>
<p><em>if you were really clever you could have the running time a linear function of ONLY the size of the compressed output</em></p>
<p>The curse of dimensionality makes this cleverness extremely difficult unless you expect specific queries.</p>
<p>Consider these queries for example:</p>
<p><code><br />
select X*Y*Z where (X*Y-Z&lt;W);<br />
select * where (X*X+Y*Y&lt;Z*Z+W*W+V*V);<br />
</code></p>
<p>Certainly, you can find ways to index them. But if I am free to come up with others, you will eventually get in serious trouble.</p>
<p>The lesson here is that there is not one indexing strategy that will always work. </p>
<p>You may enjoy this earlier blog post:</p>
<p>Understanding what makes database indexes work<br />
<a href="http://www.daniel-lemire.com/blog/archives/2008/11/07/understanding-what-makes-database-indexes-really-work/" rel="nofollow">http://www.daniel-lemire.com/blog/archives/2008/11/07/understanding-what-makes-database-indexes-really-work/</a></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Kevembuangga</title>
		<link>http://lemire.me/blog/archives/2009/12/09/run-length-encoding-part-3/comment-page-1/#comment-51989</link>
		<dc:creator>Kevembuangga</dc:creator>
		<pubDate>Fri, 11 Dec 2009 18:33:14 +0000</pubDate>
		<guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=2312#comment-51989</guid>
		<description>Typo: &quot;each possible &lt;i&gt;column&lt;/i&gt; value&quot; instead of &quot;each possible row value&quot;</description>
		<content:encoded><![CDATA[<p>Typo: &#8220;each possible <i>column</i> value&#8221; instead of &#8220;each possible row value&#8221;</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Kevembuangga</title>
		<link>http://lemire.me/blog/archives/2009/12/09/run-length-encoding-part-3/comment-page-1/#comment-51988</link>
		<dc:creator>Kevembuangga</dc:creator>
		<pubDate>Fri, 11 Dec 2009 18:26:11 +0000</pubDate>
		<guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=2312#comment-51988</guid>
		<description>&lt;i&gt;I can scan RLE-compressed data looking for the row IDs I need without ever decompressing them.  So, the running time is a linear function of the compressed size of my input. &lt;/i&gt;

Mmmmmm... I see, but you still do run thru the whole column, if you were &lt;b&gt;really&lt;/b&gt; clever you could have the running time a linear function of &lt;b&gt;ONLY the size of the compressed output&lt;/b&gt; (number of rows matching the sought for value(s)):
Instead of storing some kind of map of the row values, store &lt;i&gt;for each possible row value (cardinality)&lt;/i&gt;&lt;i&gt; a compressed bit map of the corresponding row IDs.
So whenever you know which values you are looking for you only pick and process the corresponding bit maps.
Compressed sparse bit maps can be cheap...&lt;/i&gt;</description>
		<content:encoded><![CDATA[<p><i>I can scan RLE-compressed data looking for the row IDs I need without ever decompressing them.  So, the running time is a linear function of the compressed size of my input. </i></p>
<p>Mmmmmm&#8230; I see, but you still do run thru the whole column, if you were <b>really</b> clever you could have the running time a linear function of <b>ONLY the size of the compressed output</b> (number of rows matching the sought for value(s)):<br />
Instead of storing some kind of map of the row values, store <i>for each possible row value (cardinality)</i><i> a compressed bit map of the corresponding row IDs.<br />
So whenever you know which values you are looking for you only pick and process the corresponding bit maps.<br />
Compressed sparse bit maps can be cheap&#8230;</i></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Daniel Lemire</title>
		<link>http://lemire.me/blog/archives/2009/12/09/run-length-encoding-part-3/comment-page-1/#comment-51987</link>
		<dc:creator>Daniel Lemire</dc:creator>
		<pubDate>Fri, 11 Dec 2009 17:09:00 +0000</pubDate>
		<guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=2312#comment-51987</guid>
		<description>@Kevembuangga  

Thank you for the pointers.

&lt;i&gt;You would have to run the inverse BWT on each whole column, this is just another stage of decompression (...)&lt;/i&gt; 

Who said anything about decompressing the data?

I can scan RLE-compressed data looking for the row IDs I need without ever decompressing them. So, the running time is a linear function of the compressed size of my input. (Times some factor which depends on the number of columns.)

If I first decompress the data, then my processing time will be a function of the real (uncompressed) data size. I save on storage but not on processing time. In fact, the processing time will be longer than if I had worked with uncompressed data (not accounting for possible IO savings).

And I am not even discussing the processing overhead of computing several BWT.</description>
		<content:encoded><![CDATA[<p>@Kevembuangga  </p>
<p>Thank you for the pointers.</p>
<p><i>You would have to run the inverse BWT on each whole column, this is just another stage of decompression (&#8230;)</i> </p>
<p>Who said anything about decompressing the data?</p>
<p>I can scan RLE-compressed data looking for the row IDs I need without ever decompressing them. So, the running time is a linear function of the compressed size of my input. (Times some factor which depends on the number of columns.)</p>
<p>If I first decompress the data, then my processing time will be a function of the real (uncompressed) data size. I save on storage but not on processing time. In fact, the processing time will be longer than if I had worked with uncompressed data (not accounting for possible IO savings).</p>
<p>And I am not even discussing the processing overhead of computing several BWT.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Kevembuangga</title>
		<link>http://lemire.me/blog/archives/2009/12/09/run-length-encoding-part-3/comment-page-1/#comment-51986</link>
		<dc:creator>Kevembuangga</dc:creator>
		<pubDate>Fri, 11 Dec 2009 16:36:40 +0000</pubDate>
		<guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=2312#comment-51986</guid>
		<description>&lt;i&gt;How do you extract these row IDs quickly&lt;/i&gt;

You would have to run the inverse BWT on each whole column, this is just another stage of decompression, unless there is a very clever hack (on a level of this &lt;a href=&quot;http://www.codemaestro.com/reviews/9&quot; rel=&quot;nofollow&quot;&gt;kind&lt;/a&gt;!) to peek the ID values from the encoded BWT.
Do you mean that with simple RLE you can pick the row IDs corresponding to the Xs an Ys somehow by &quot;direct access&quot; from the compressed columns without actually running thru the whole columns?
And don&#039;t you use boolean operations on bit maps to compute such sets of IDs?

Anyway, may be you should forget about Burrows-Wheeler which is overrated due to its good performance on text but bad for Markov sources, which column data probably are most often.
See &lt;a href=&quot;http://www.math.tau.ac.il/~haimk/pubs.html&quot; rel=&quot;nofollow&quot;&gt;Most Burrows-Wheeler based compressors are not optimal&lt;/a&gt;
H. Kaplan and E. Verbin, CPM&#039;07</description>
		<content:encoded><![CDATA[<p><i>How do you extract these row IDs quickly</i></p>
<p>You would have to run the inverse BWT on each whole column, this is just another stage of decompression, unless there is a very clever hack (on a level of this <a href="http://www.codemaestro.com/reviews/9" rel="nofollow">kind</a>!) to peek the ID values from the encoded BWT.<br />
Do you mean that with simple RLE you can pick the row IDs corresponding to the Xs an Ys somehow by &#8220;direct access&#8221; from the compressed columns without actually running thru the whole columns?<br />
And don&#8217;t you use boolean operations on bit maps to compute such sets of IDs?</p>
<p>Anyway, may be you should forget about Burrows-Wheeler which is overrated due to its good performance on text but bad for Markov sources, which column data probably are most often.<br />
See <a href="http://www.math.tau.ac.il/~haimk/pubs.html" rel="nofollow">Most Burrows-Wheeler based compressors are not optimal</a><br />
H. Kaplan and E. Verbin, CPM&#8217;07</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Daniel Lemire</title>
		<link>http://lemire.me/blog/archives/2009/12/09/run-length-encoding-part-3/comment-page-1/#comment-51985</link>
		<dc:creator>Daniel Lemire</dc:creator>
		<pubDate>Fri, 11 Dec 2009 12:39:52 +0000</pubDate>
		<guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=2312#comment-51985</guid>
		<description>@Kevembuangga Suppose you want to get all row IDs where the first column has value X and the second column has value Y. How do you extract these row IDs quickly, if the columns were reordered independently?</description>
		<content:encoded><![CDATA[<p>@Kevembuangga Suppose you want to get all row IDs where the first column has value X and the second column has value Y. How do you extract these row IDs quickly, if the columns were reordered independently?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Kevembuangga</title>
		<link>http://lemire.me/blog/archives/2009/12/09/run-length-encoding-part-3/comment-page-1/#comment-51984</link>
		<dc:creator>Kevembuangga</dc:creator>
		<pubDate>Fri, 11 Dec 2009 07:38:09 +0000</pubDate>
		<guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=2312#comment-51984</guid>
		<description>&lt;i&gt;makes reconstructing single rows a messy adventure&lt;/i&gt;

Beside having to &quot;guess&quot; the value of the given row items from BW reordered columns how is that more &lt;i&gt;messy&lt;/i&gt; than having to unfold the RLE compressed columns up to the row rank you are interested in?</description>
		<content:encoded><![CDATA[<p><i>makes reconstructing single rows a messy adventure</i></p>
<p>Beside having to &#8220;guess&#8221; the value of the given row items from BW reordered columns how is that more <i>messy</i> than having to unfold the RLE compressed columns up to the row rank you are interested in?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Daniel Lemire</title>
		<link>http://lemire.me/blog/archives/2009/12/09/run-length-encoding-part-3/comment-page-1/#comment-51975</link>
		<dc:creator>Daniel Lemire</dc:creator>
		<pubDate>Thu, 10 Dec 2009 12:57:22 +0000</pubDate>
		<guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=2312#comment-51975</guid>
		<description>@Kevembuangga
Yes, I thought about it... But reordering the columns independently makes reconstructing single rows a messy adventure.</description>
		<content:encoded><![CDATA[<p>@Kevembuangga<br />
Yes, I thought about it&#8230; But reordering the columns independently makes reconstructing single rows a messy adventure.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Kevembuangga</title>
		<link>http://lemire.me/blog/archives/2009/12/09/run-length-encoding-part-3/comment-page-1/#comment-51973</link>
		<dc:creator>Kevembuangga</dc:creator>
		<pubDate>Thu, 10 Dec 2009 06:56:35 +0000</pubDate>
		<guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=2312#comment-51973</guid>
		<description>But did you try Burrows–Wheeler within a single column irrespective of other columns/rows shuffling?</description>
		<content:encoded><![CDATA[<p>But did you try Burrows–Wheeler within a single column irrespective of other columns/rows shuffling?</p>
]]></content:encoded>
	</item>
</channel>
</rss>

