<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Parsing text files is CPU bound</title>
	<atom:link href="http://lemire.me/blog/archives/2008/12/08/parsing-text-files-is-cpu-bound/feed/" rel="self" type="application/rss+xml" />
	<link>http://lemire.me/blog/archives/2008/12/08/parsing-text-files-is-cpu-bound/</link>
	<description>Computer Scientist and Open Scholar: Databases, Information Retrieval, Business Intelligence.</description>
	<lastBuildDate>Tue, 07 Feb 2012 23:39:16 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
	<item>
		<title>By: Steven</title>
		<link>http://lemire.me/blog/archives/2008/12/08/parsing-text-files-is-cpu-bound/comment-page-1/#comment-50353</link>
		<dc:creator>Steven</dc:creator>
		<pubDate>Thu, 18 Dec 2008 09:54:50 +0000</pubDate>
		<guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=1619#comment-50353</guid>
		<description>If I had to wager a guess, I&#039;d say that all the memory management for storing the results are to blame - malloc()/free() are notoriously slow, so calling them the hundreds of millions of times a multi-gigabyte file requires is very likely the problem.  Try using a static structure for output, or else just discard the output entirely (still computing it of course), and see what happens.</description>
		<content:encoded><![CDATA[<p>If I had to wager a guess, I&#8217;d say that all the memory management for storing the results are to blame &#8211; malloc()/free() are notoriously slow, so calling them the hundreds of millions of times a multi-gigabyte file requires is very likely the problem.  Try using a static structure for output, or else just discard the output entirely (still computing it of course), and see what happens.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Preston L. Bannister</title>
		<link>http://lemire.me/blog/archives/2008/12/08/parsing-text-files-is-cpu-bound/comment-page-1/#comment-50347</link>
		<dc:creator>Preston L. Bannister</dc:creator>
		<pubDate>Fri, 12 Dec 2008 20:10:33 +0000</pubDate>
		<guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=1619#comment-50347</guid>
		<description>Matt, you will find I am very intentional in my use of language. :)

What is the definition of sanity? Sanity is defined relative to what most people do most of the time. (Which might seem a bit odd - but that is another discussion.) 

When coming up with a solution, you have to make &quot;sane&quot; assumptions about the design space. You cannot assume unlikely resources (at least not usually). You have to assume what most of target hardware will have, most of the time.

If we are interested in the performance of parsing large files, that would be because we *have* large files. Most likely we have a series of very large files, generated as a part of some repeating process. (Otherwise the problem becomes uninteresting.)

Combine the above two considerations, with the cost of SSD storage, and you have an awkward fit. Also, after poking around a bit on the web, it looks as though the *sustained* read rates for SSD is currently around 40-100MB/s. (We are not interested in burst rates.)

Yes, there have always been high-end (and expensive!) storage systems that deliver much higher than average performance. If you are designing a solution for a general population of machines, then it would be insane to assume performance present in only a very small minority.

So the use of &quot;insanely fast&quot; can carry exactly the right freight when talking about a design. :)</description>
		<content:encoded><![CDATA[<p>Matt, you will find I am very intentional in my use of language. <img src='http://lemire.me/blog/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>What is the definition of sanity? Sanity is defined relative to what most people do most of the time. (Which might seem a bit odd &#8211; but that is another discussion.) </p>
<p>When coming up with a solution, you have to make &#8220;sane&#8221; assumptions about the design space. You cannot assume unlikely resources (at least not usually). You have to assume what most of target hardware will have, most of the time.</p>
<p>If we are interested in the performance of parsing large files, that would be because we *have* large files. Most likely we have a series of very large files, generated as a part of some repeating process. (Otherwise the problem becomes uninteresting.)</p>
<p>Combine the above two considerations, with the cost of SSD storage, and you have an awkward fit. Also, after poking around a bit on the web, it looks as though the *sustained* read rates for SSD is currently around 40-100MB/s. (We are not interested in burst rates.)</p>
<p>Yes, there have always been high-end (and expensive!) storage systems that deliver much higher than average performance. If you are designing a solution for a general population of machines, then it would be insane to assume performance present in only a very small minority.</p>
<p>So the use of &#8220;insanely fast&#8221; can carry exactly the right freight when talking about a design. <img src='http://lemire.me/blog/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Matt</title>
		<link>http://lemire.me/blog/archives/2008/12/08/parsing-text-files-is-cpu-bound/comment-page-1/#comment-50346</link>
		<dc:creator>Matt</dc:creator>
		<pubDate>Wed, 10 Dec 2008 21:48:22 +0000</pubDate>
		<guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=1619#comment-50346</guid>
		<description>&lt;i&gt;Unless you have an insanely fast disk subsystem (not something you find in usual desktops), reading anything as simple as CSV files should be very I/O-bound, and not CPU-bound.&lt;/i&gt;

Please refrain from using words such as &quot;insanely&quot; :-)  Let&#039;s put a number on it shall we.  Suppose we would have one of these new SSD drives doing 200MB/s.
On a 2GHz CPU you would have 10 cycles to process a single byte of data.  That is not very much considering the fact that with those 10 cycles you need to handle String encoding (UTF-8, etc, etc), Integer, Number, Date and Boolean encoding.

Looking at these numbers you see WHY reading CSV files is nearly always CPU-bound for high end storage sub-systems.</description>
		<content:encoded><![CDATA[<p><i>Unless you have an insanely fast disk subsystem (not something you find in usual desktops), reading anything as simple as CSV files should be very I/O-bound, and not CPU-bound.</i></p>
<p>Please refrain from using words such as &#8220;insanely&#8221; <img src='http://lemire.me/blog/wp-includes/images/smilies/icon_smile.gif' alt=':-)' class='wp-smiley' />   Let&#8217;s put a number on it shall we.  Suppose we would have one of these new SSD drives doing 200MB/s.<br />
On a 2GHz CPU you would have 10 cycles to process a single byte of data.  That is not very much considering the fact that with those 10 cycles you need to handle String encoding (UTF-8, etc, etc), Integer, Number, Date and Boolean encoding.</p>
<p>Looking at these numbers you see WHY reading CSV files is nearly always CPU-bound for high end storage sub-systems.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: KWillets</title>
		<link>http://lemire.me/blog/archives/2008/12/08/parsing-text-files-is-cpu-bound/comment-page-1/#comment-50344</link>
		<dc:creator>KWillets</dc:creator>
		<pubDate>Wed, 10 Dec 2008 20:31:04 +0000</pubDate>
		<guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=1619#comment-50344</guid>
		<description>I can&#039;t find the code.</description>
		<content:encoded><![CDATA[<p>I can&#8217;t find the code.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: John Schneider</title>
		<link>http://lemire.me/blog/archives/2008/12/08/parsing-text-files-is-cpu-bound/comment-page-1/#comment-50343</link>
		<dc:creator>John Schneider</dc:creator>
		<pubDate>Tue, 09 Dec 2008 18:18:56 +0000</pubDate>
		<guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=1619#comment-50343</guid>
		<description>Actually, the binary XML people stress both compactness and processing efficiency. Some of the binary XML use cases are CPU bound because they have fast networks that are not congested. Others are I/O bound because they have slow, wireless, or congested networks.

The EXI evaluation document you referenced is an early draft that does not yet include processing efficiency measurements. The next draft will include these measurements. In the mean time, see &lt;a href=&quot;http://www.w3.org/TR/exi-measurements/&quot; rel=&quot;nofollow&quot;&gt;http://www.w3.org/TR/exi-measurements/&lt;/a&gt; for a complete collection of binary XML compactness and processing efficiency measurements, including some taken over high-speed and wireless networks. 

Also, see &lt;a href=&quot;http://www.agiledelta.com/efx_perffeatures.html&quot; rel=&quot;nofollow&quot;&gt;http://www.agiledelta.com/efx_perffeatures.html&lt;/a&gt; for compactness and processing speed measurements from a commercial EXI implementation.

  All the best,

  John</description>
		<content:encoded><![CDATA[<p>Actually, the binary XML people stress both compactness and processing efficiency. Some of the binary XML use cases are CPU bound because they have fast networks that are not congested. Others are I/O bound because they have slow, wireless, or congested networks.</p>
<p>The EXI evaluation document you referenced is an early draft that does not yet include processing efficiency measurements. The next draft will include these measurements. In the mean time, see <a href="http://www.w3.org/TR/exi-measurements/" rel="nofollow">http://www.w3.org/TR/exi-measurements/</a> for a complete collection of binary XML compactness and processing efficiency measurements, including some taken over high-speed and wireless networks. </p>
<p>Also, see <a href="http://www.agiledelta.com/efx_perffeatures.html" rel="nofollow">http://www.agiledelta.com/efx_perffeatures.html</a> for compactness and processing speed measurements from a commercial EXI implementation.</p>
<p>  All the best,</p>
<p>  John</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Daniel Lemire</title>
		<link>http://lemire.me/blog/archives/2008/12/08/parsing-text-files-is-cpu-bound/comment-page-1/#comment-50340</link>
		<dc:creator>Daniel Lemire</dc:creator>
		<pubDate>Tue, 09 Dec 2008 03:15:28 +0000</pubDate>
		<guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=1619#comment-50340</guid>
		<description>@Bannister 

&lt;i&gt;How big was your input file?&lt;/i&gt;

Several GiBs. I work with large files.

&lt;i&gt; You should be running any benchmark more than once to get consistent times. The input file must be bigger than memory so that it is not cached in memory by the operating system.&lt;/i&gt;

This is a &quot;top -o cpu&quot; command on another process while the program is running.

My observations are not meant to be scientific, but I can tell you that optimizing my CSV parsing code helped quite a bit speed it up. Had the process been &quot;very I/O bound&quot;, optimizing the code would not have mattered. My code is somewhere on google code (it is open source).

&lt;i&gt; Unless you have an insanely fast disk subsystem (not something you find in usual desktops), reading anything as simple as CSV files should be very I/O-bound, and not CPU-bound.&lt;/i&gt;

It should. Shouldn&#039;t it?

I made a blog post out of it because I find all of this puzzling. I&#039;d be glad to see someone do a follow-up analysis. Maybe someone could prove me wrong? I&#039;d like that.</description>
		<content:encoded><![CDATA[<p>@Bannister </p>
<p><i>How big was your input file?</i></p>
<p>Several GiBs. I work with large files.</p>
<p><i> You should be running any benchmark more than once to get consistent times. The input file must be bigger than memory so that it is not cached in memory by the operating system.</i></p>
<p>This is a &#8220;top -o cpu&#8221; command on another process while the program is running.</p>
<p>My observations are not meant to be scientific, but I can tell you that optimizing my CSV parsing code helped quite a bit speed it up. Had the process been &#8220;very I/O bound&#8221;, optimizing the code would not have mattered. My code is somewhere on google code (it is open source).</p>
<p><i> Unless you have an insanely fast disk subsystem (not something you find in usual desktops), reading anything as simple as CSV files should be very I/O-bound, and not CPU-bound.</i></p>
<p>It should. Shouldn&#8217;t it?</p>
<p>I made a blog post out of it because I find all of this puzzling. I&#8217;d be glad to see someone do a follow-up analysis. Maybe someone could prove me wrong? I&#8217;d like that.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Preston L. Bannister</title>
		<link>http://lemire.me/blog/archives/2008/12/08/parsing-text-files-is-cpu-bound/comment-page-1/#comment-50339</link>
		<dc:creator>Preston L. Bannister</dc:creator>
		<pubDate>Tue, 09 Dec 2008 02:21:36 +0000</pubDate>
		<guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=1619#comment-50339</guid>
		<description>How big was your input file?

You should be running any benchmark more than once to get consistent times. The input file must be bigger than memory so that it is not cached in memory by the operating system. 

Unless you have an insanely fast disk subsystem (not something you find in usual desktops), reading anything as simple as CSV files should be very I/O-bound, and not CPU-bound.</description>
		<content:encoded><![CDATA[<p>How big was your input file?</p>
<p>You should be running any benchmark more than once to get consistent times. The input file must be bigger than memory so that it is not cached in memory by the operating system. </p>
<p>Unless you have an insanely fast disk subsystem (not something you find in usual desktops), reading anything as simple as CSV files should be very I/O-bound, and not CPU-bound.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Daniel Lemire</title>
		<link>http://lemire.me/blog/archives/2008/12/08/parsing-text-files-is-cpu-bound/comment-page-1/#comment-50338</link>
		<dc:creator>Daniel Lemire</dc:creator>
		<pubDate>Mon, 08 Dec 2008 23:21:05 +0000</pubDate>
		<guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=1619#comment-50338</guid>
		<description>@KWillets

&lt;i&gt; Compression isn&#039;t just for I/O bandwidth. &lt;/i&gt;

I agree.

&lt;i&gt; Getting optimally-sized operands is also a benefit, although I haven&#039;t seen it researched much.&lt;/i&gt;


Compression can be used to diminish the number of CPU cycles used. Definitively.

&lt;i&gt; If you change the question from &quot;what is the minimum number of bits needed to store this data&quot; to &quot;what is the minimum number of bits needed to construct the output&quot;, eg a join or intersection, etc., it becomes more interesting.&lt;/i&gt;

The minimum size of the output is used a lower bound on the complexity of the problem in algorithmic design. So, it can be used to show that you have an optimal algorithm.</description>
		<content:encoded><![CDATA[<p>@KWillets</p>
<p><i> Compression isn&#8217;t just for I/O bandwidth. </i></p>
<p>I agree.</p>
<p><i> Getting optimally-sized operands is also a benefit, although I haven&#8217;t seen it researched much.</i></p>
<p>Compression can be used to diminish the number of CPU cycles used. Definitively.</p>
<p><i> If you change the question from &#8220;what is the minimum number of bits needed to store this data&#8221; to &#8220;what is the minimum number of bits needed to construct the output&#8221;, eg a join or intersection, etc., it becomes more interesting.</i></p>
<p>The minimum size of the output is used a lower bound on the complexity of the problem in algorithmic design. So, it can be used to show that you have an optimal algorithm.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: KWillets</title>
		<link>http://lemire.me/blog/archives/2008/12/08/parsing-text-files-is-cpu-bound/comment-page-1/#comment-50337</link>
		<dc:creator>KWillets</dc:creator>
		<pubDate>Mon, 08 Dec 2008 22:16:53 +0000</pubDate>
		<guid isPermaLink="false">http://www.daniel-lemire.com/blog/?p=1619#comment-50337</guid>
		<description>Compression isn&#039;t just for I/O bandwidth.  Getting optimally-sized operands is also a benefit, although I haven&#039;t seen it researched much.  

If you change the question from &quot;what is the minimum number of bits needed to store this data&quot; to &quot;what is the minimum number of bits needed to construct the output&quot;, eg a join or intersection, etc., it becomes more interesting.</description>
		<content:encoded><![CDATA[<p>Compression isn&#8217;t just for I/O bandwidth.  Getting optimally-sized operands is also a benefit, although I haven&#8217;t seen it researched much.  </p>
<p>If you change the question from &#8220;what is the minimum number of bits needed to store this data&#8221; to &#8220;what is the minimum number of bits needed to construct the output&#8221;, eg a join or intersection, etc., it becomes more interesting.</p>
]]></content:encoded>
	</item>
</channel>
</rss>

