I carry a pocketbook and a pen everywhere. At night, my pocketbook is by my bed. All creative workers should carry notebooks.

Organizing and collecting ideas are different tasks. My pocketbook is strictly for collection. Every few days,  I start a new page: a list of reminders on one side, and diagrams on the other side. Important ideas get processed and stored on my laptop. I throw away used pocketbooks.

It is difficult to find quality pocketbooks. Here are my recommendations:

  • My pocketbook must last a few months. Paper must be thick and of good quality. I prefer unlined paper. I need a ribbon marker to quickly find the current page. Paperblanks make some good and inexpensive Pocket Companions fitting the bill. Alas, Paperblanks does not sell directly to customers. The retail info on the Paperblanks’ web site is helpful.
  • I prefer black gel pens.  Rechargeable pens create less garbage and they are often of better quality. For about a year, I have had good luck with my Zebra gel pens.

Note: I do not profit in any way if you buy a Paperblanks pocketbook or Zebra pen.

In his most recent essay, After the credentials, Paul Graham tells us that in South Korea where “college entrance exams determine 70 to 80 percent of a person’s future.” Fortunately, the Americans know better: “Where you go to college still matters, but not like it used to.”

Paul writes good essays, but they are thin on research. How much is your alma matter a predictor of your success? The research is available. For example, in Regression and Matching Estimates of the Effects of Elite College Attendance on Career Outcomes, Brand and Halaby write:

Our results suggest that in terms of college quality, there is not only no direct effect on mid- and late-career attainment, but no significant effect at all.  This study questions the consequential belief that an elite college education necessarily translates into privileged socioeconomic status throughout the life course.

To sum it up: If you are a privileged kid, you will do well even if you go to a local college.

Because my research budget for this blog is $0, I will do my own survey about a special job: the presidency in the USA and the office of prime minister in Canada. Do state leaders attend a small set of colleges?

Let us review where the American presidents got their first degree:

What about Canadian prime ministers?

Based on this evidence alone, if I were to coach a kid for a political career, I would ignore where he gets his degree. This makes sense. You  become president or prime minister several years after earning your degree. By the time you have the experience required for the job, any college premium is gone.

See also my post The 2 myths getting students into ivy-league schools.

Disclaimer: I am a graduate of the University of Toronto, maybe the most prestigious university in Canada.

I am continuing my fun saga to determine whether parsing CSV files is CPU bound or I/O bound. Recall that I posted some C++ code and reported that it took 96 seconds of process time to parse a given 2GB CSV file and just 27 seconds to read the lines without parsing. Preston L. Bannister correctly pointed out that using the clock() function is wrong. So I updated my code using his ZTimer class instead. The new numbers are 103 seconds for the full parsing and 57 seconds to just parse the lines.

Some anonymous reader claimed that my code was still grossly inefficient. I do not like arguing without evidence.

Ah! But Unix utilities can also parse CSV files. They are usually efficient. Let us use the cut command:


$ time cut -f 1,2,3,4 -d , ./netflix.csv > /dev/null
real 1m59.596s
user 1m53.163s
sys 0m3.775s

So, 120 seconds?

What about sorting the CSV file? Of course, it is a lot more expensive: 504 seconds.

$ time sort -t, ./netflix.csv > /dev/null
real 8m23.985s
user 2m28.855s
sys 1m1.467s

Finally, for a basis of comparison, let us just dump the file to /dev/null:

$ time cat ./netflix.csv > /dev/null
real 0m29.337s
user 0m0.029s
sys 0m2.541s

The final story:

parsing method time elapsed
cat Unix command 29 s
Daniel’s line parser 57 s
Daniel’s CSV parser 103 s
cut Unix command 120 s
sort Unix command 504 s

Analysis: My C++ code is not grossly inefficient. If the I/O cost of reading the file is about 30 seconds, parsing it takes about 100 seconds. My preliminary conclusion is that parsing CSV files is more CPU than I/O bound.

(See update 2.)

In a recent blog post, I said that parsing simple CSV files could be CPU bound. By parsing, I mean reading the data on disk and copying it into an array. I also strip the field values of spurious white space.

You can find my C++ code on my server.

A reader criticized my implementation as follows:

  • I use the C++ getline function to read the lines. The reader commented that “getline does one heap allocation and copy for every line.” I doubt that getline generates heap allocation each time it is called: I reuse the same string object for every call.
  • For each field value, I did two heap allocations and two copies. I now reuse the same string objects for fields, thus limiting the number of heap allocations.
  • The reader commented that I should use a custom allocator to avoid heap allocations. Currently, if the CSV file has x fields, I use x+1 string objects (a tiny number) and small constant number of heap allocations.

Despite these changes, I still get that parsing CSV files is strongly CPU bound:


$ ./parsecsv ./netflix.csv
without parsing: 26.55
with parsing: 95.99

However, doing away with the heap allocations at every line did reduce the parsing running time by a factor of two. It is not difficult to believe I could close the gap. But I still see no evidence that parsing CSV files is strongly I/O bound as some of my readers have stated. Consider that in real applications, I would need to convert field values to dates or to numerical values. I might also need to filter values, or support fancier CSV formats.

My experiments are motivated by a post by Matt Casters. Some said that Java was guilty. I use C++ and I get a similar result. So far at least. Can you tell me where I went wrong?

Disclaimer: Yet again, I do not claim that my code is nearly optimal. My exact claim is that reading CSV files may be I/O bound using reasonable code. I find this very surprising.

In my post Computing argmax fast in Python, I reported that Python has no builtin function to compute argmax, the position of a maximal value. I provided one such function and asked people to improve my solution. Here are the results:

argmax function running time
array.index(max(array)) 0.1 s
max(izip(array, xrange(len(array))))[1] 0.2 s

Conclusion: array.index(max(array)) is simpler and faster.

Update: Please see The language interpreters are the new machines.

« Previous PageNext Page »

Powered by WordPress