Do NULL markers in SQL cause any harm?

The relational model, and by extension, the language SQL supports the notion of NULL marker. It is commonly used to indicate that some attribute is unknown or non applicable. NULL markers are a bit strange because they are not values per se. Hence, the predicate 1 = NULL is neither true nor false. Indeed, the inventor of the relational model, E. F. Codd, proposed a 3-value logic model: predicates are true, false or unknown. This lives on even today. Our entire civilization runs on database systems using an unintuitive 3-value logic. Isn’t that something!

Unfortunately, in real life, predicates either evaluate to true, or they don’t. C. J. Date showed that NULL markers end up giving you inconsistent semantics. So our civilization runs on database systems that can be inconsistent!

Yet the NULL markers were introduced for a reason: some things do remain unknown or are non applicable. We can handle these issues with more complicated schemas, but it is not practical. So database designers do allow NULL markers.

How did Codd react when it was pointed out to him that NULL markers make his model inconsistent? He essentially told us that NULL markers are in limbo:

(…) the normalization concepts do NOT apply, and should NOT be applied, globally to those combinations of attributes and tuples containing marks. (…) The proper time for the system to make this determination is when an attempt is made to replace the pertinent mark by an actual db-value.

So the mathematical rigor does not apply to NULL markers. Period.

This sounds pretty bad. I am rather amazed that Codd could get away with this.

But how bad is it in real life?

Let us consider WordPress, the blog engine I am using. As part of the core database schema, only the tables wp_postmeta, wp_usermeta and wp_commentmeta allow NULL markers. These tables are exclusively used to store metadata describing blog posts, users and comments. If this metadata is somehow inconsistent, the blog engine will not fall apart. It may hurt secondary features, such as advanced navigation, but the core data (posts, users and comments) will remain unaffected.

Date was repeatedly asked to prove that NULL markers were indeed a problem. I do not think that he ever conclusively showed that they were a real problem. Anyhow, our civilization has not collapsed yet.

Does anyone has any evidence that NULL markers are a bona fide problem in practice? Oh! Sure! Incompetent people will always find a way to create problems. So let us assume we are dealing with reasonably smart people doing reasonable work.

Credit: This post is motivated by an exchange with A. Badia from Louisville University.

Example of SQL’s inconsistency:

We are given two tables: Suppliers (sno,city) and Parts(pno,city). The tables have both a single row; (S1,’London’) and (P1,null) respectively. That is, we have one supplier in London as well as one part for which the location is left unspecified (hence the null marker).

We have the following query:

Select sno, pno
From Suppliers, Parts
Where Parts.city <> Suppliers.city
Or Parts.city <> ‘Paris’;

In SQL, this query would return nothing due to Codd’s 3-value logic because the where clause only selects row when the predicate is true.

Yet we know that if a physical part is actually located somewhere, it is either not in London or not in Paris. So the answer is wrong.

Let us consider another interpretation: maybe the part P1 is fictitious. It is not physically available anywhere. In such a case, the SQL query still fails to return the correct answer as the part P1 is not in London.

Maybe we could assume instead that the part P1 is available everywhere: this later interpretation is also incorrect
because the query

Select * from Parts where Parts.city = ‘Paris’

will return nothing.

Where are the “big problem” jobs?

Several authors, scientists and entrepreneurs have lamented our poor ability to innovate. It seems that industry is recruiting few people to work on hard problems, except maybe when they are supported by the government:

Private businesses seem remarkably uninterested in tackling serious problems such as energy despite soaring prices and evident problems. In the rare cases where someone may be attempting to solve these problems, one often finds the heavy hand of government funding, for better or worse.

Many years ago, I pursued a Ph.D., not to become a professor or even a government researcher, but rather to pursue industrial R&D. I had vague dreams about working in some corporate research laboratory. I imagined I would be working on cool new technology to solve important problems. I never even questioned that such job existed… until I went looking for them!

I was disappointed. Instead, I ended up starting my own business. Amazingly, it worked! We got several contracts to do leading-edge work that still determines my research agenda to this day. I acquired a taste for problems that matter. However, much of our funding (maybe 40%) came indirectly from the government.

“Where are the big problem jobs?” I would say that they are either supported by the government or, better yet, driven by entrepreneurship.1

The problem with government R&D is that it has a bad track record at helping the economy, outside of a few areas such as agriculture. Baumol made a similar point in the Free-Market Innovation Machine: Imperial China, the Roman Empire and even the USSR had great scholars and no shortage of technological innovation. However, they lacked the means to reap the benefits of this research. Government R&D often comes about as a mix of bureaucratic campuses, bureaucratic government laboratories and bureaucratic corporations. That’s only exciting if you have a perverse sense of humor.

Want to kill a good idea? Assign it to a committee.

What Imperial China, the Roman Empire and the USSR lacked, was entrepreneurship. I conjecture that the number of science graduates who become entrepreneurs is a better predictor of economic progress than the number of new Ph.D.s. And if you are a young woman or man who wants to work on big problems, you would be better off preparing yourself for some form of entrepreneurship… if you want your work to matter.

Too many scientists think that science is created in a laboratory and then turned into a product by puny engineers. The reverse process is at least as likely to happen. It is by trying to solve important problems that significant science comes about.

Yet we are not used to think of scientists as entrepreneurs. We are not used of thinking of science as a process where you get money, hire people and make a profit. In some ways, the idea of patenting a new refrigerator seems contrary to science. Yet Einstein held such a patent that he sold for profit.

Our idea of the scientist has too much to do with the Mandarins of Imperial China and not enough to do with Sergey Brin.

1– There are corporate exceptions. For example, Google seems to be working on a few big problems, such as the self-driving car and Google Glass.

Does academic research cause economic growth?

In most developed countries, government massively funds through academic grants, government laboratories, tax credits and research contracts: government R&D alone can often reach 1% of the GDP. In Canada, the government loves tax credits. In the US, the government spends about 60% of all its R&D funding on the military.

Is this government funding a good thing?

What economists tell us is that R&D funding is probably the most important factor determining economic growth. If you want economic growth, then you should be favorable to more research funding. And therefore, it follows that government funding for research ought to be a good idea.

The problem is that it is private R&D that contributes to economic growth, not government R&D:

The overall rate of return to R&D is very large, perhaps 25 percent as a private return and a total of 65 percent for social returns. However, these returns apply only to privately financed R&D in industry. Returns to many forms of publicly financed R&D are near zero. (Sveikauskas, Bureau of Labor Statistics, Washington, 2007)

The ubiquitous and fairly pessimistic finding which emerges from the literature is that privately funded R&D contributes significantly to output growth, whereas publicly financed R&D has little or no direct effect. (Capron and van Pottelsberghe, 1998)

(…) regressions including separate variables for business-performed R&D and that performed by other institutions (mainly public research institutes) suggest that it is the former that drives the positive association between total R&D intensity and output growth. (…)
(The Sources of Economic Growth in OECD Countries, 2003).

So the argument that economists have to make to justify government spending on R&D is that when the government pays for research, it entices companies to invest more. Thankfully, the correlation between public spending on R&D and private spending is good, except maybe for government-specific programs (such as the military R&D that the US is so fond of). As a policy, it seems that it makes sense to entice individuals and companies to do more research. Nevertheless, some economists remain cautious:

(…) the government should be careful in stimulating higher research expenditures. Recent rates of return on R&D are estimated to have reached an all-time low spanning the last 45 years, (…) Firms are also not convinced any more that their R&D investments will yield high returns (…) (Lang, 1999)

Enticing private companies to do more R&D is fine, but are there strong mechanisms turning academic research into more private-sector R&D? There is a correlation, but is there a causal relationship? Most academic research is far removed from economic activities and few professors turn their research into viable products.

Do countries that publish more also see more growth? Yes they do (but wait before concluding):

The correlation between GDP and publications is (…) high (…) for all countries in the sample analyzed here (77.7-99.6%) (Lee et al., 2011)

This correlation can mean one of two things (or a mix of both):

  1. Richer countries can afford more academic research.
  2. More academic research makes country richer.

Which is it?

We have some historical evidence in this respect.

In the 17th and 18th centuries, France had the best funded intellectuals in the world. Meanwhile, the British government did not subsidize science and scholarship much. Yet it is in Great Britain that we saw the rise of the industrial evolution and of modern science.

Similarly, the US saw a massive economic growth from its early days all the way to WWII… without any public funding for R&D. While other countries like France kept on subsidizing scholarship, it did not put them ahead economically. The US became the dominant world power in 1950 without having invested much at all, as a government, in science and engineering. The US only started investing seriously in research at the end of the 1950s, because it feared the USSR. The USSR was investing in engineering through the state apparatus and the US felt that it had to match it. Remarkably, these massive investments were not followed by an increase in growth. And we know what happened to the USSR.

Of course, today, the US is subsidizing academic research generously: according to some sources, the US is twice as generous as Europe with respect to per capita government R&D. But is the US a wealthy country because of these subsidies, or can it afford these subsidies because its wealth?

Did China get richer because it grew its Universities, or did it grew its Universities because it was richer?

Economists can answer these questions using Granger causality tests. Granger is the economist who showed that we could use the analysis of time series data to establish causality between variables.

And the evidence indicates that, for wealthy countries, government funding for academic research does not cause economic growth:

For European Community member states, the US and Japan correlation between the GDP and number of publications of a given year proved to be non-significant. (…) Studying data referring to consecutive time periods revealed that there is no direct relationship between the GDP and information production of countries. (Vinkler, 2008)

For [rich nations], there is no significant causal relationship between research production and the economy. (Lee et al., 2011)

Conclusion: There are good reasons to pursue academic research. However, academic research is more of a gift economy than an economic growth policy, at least for rich countries. Richer countries can afford to do more academic research, but academic research is not what makes you rich (I should know!).

Further reading: For supporting evidence, see Sex, Science And Profits by Terence Kealey and Faith Based Science Policy by Roger Pielke, Jr. For contrarian evidence, see Federal Support for Research and Development and The iPad wouldn’t be here without federal research dollars (via Greg Linden). Moreover, you may want to have a look at Where Good Ideas Come From by Steve Johnson for an anecdote-based analysis.

Disclosure: I have worked in industry R&D, in government R&D and as an academic researcher. I have held a federal research grant for over 10 years. My job would not exist without government funding.

Peer review without journals or conferences

Almost all scientists ask their peers to review their work outside of the formal process offered by journals and conferences. Young scientists are routinely told to take their manuscripts to a senior scientist and ask for a review.

With some success, I have even used my blog as a peer review device: e.g., we received many comments on our manuscript on fast integer decoding through a blog post I wrote. If you read some of my recent research articles, you will find that I acknowledge people who commented on this blog. (I have even ended up writing papers with some of them, but that is a different story.) In effect, my blog has helped me get extra reviews for some of my work! I couldn’t be more grateful!

I am not really interested in making journals and conferences go away, but I am interested in going beyond them. I fear that they are often limiting us:

  • In fields such as computer science, limiting the review to journals and conferences effectively cuts most non-academics out. They are also limiting the review to a narrow band of experts. If you are trying to solve problems that matter, this might be entirely wrong.
  • Traditional peer review is anonymous. In principle, this makes it fair and transparent. In practice, it can be needlessly alienating. For example, I would much rather have an open exchange with some authors that I criticize rather than just send an anonymous note. We might both benefit from the exchange. One pattern I have noticed is that some of my (well meaning) comments get ignored. Even when they require little work and can only benefit the authors. We have put walls in the peer review process, and there are good reasons for these walls, but we could do without them if we reinvented the process.

Hence, I have launched an open invitation to the world: send me your drafts, and if I find them interesting, I will review them and then tell the world about them (through social networks) if you revise them to my satisfaction.

My goal is not to replace journals and conferences all by myself, but I do see a growing trend whereas people point to papers that have not yet been peer reviewed and say: I read this, it looks good to me! I’d like more people to participate in this new emerging model.

Anyhow, so far, only one courageous fellow agreed with my terms: Nathanaël Schaeffer sent me his paper Efficient Spherical Harmonic Transforms aimed at pseudo-spectral numerical simulations. I wrote a review, the same way I would for a journal, and I sent it to him. He produced a revised version that took into account my criticism. I am now telling you that if you care for spherical harmonic transforms, you should definitively check out his paper. (Update: Schaeffer’s paper has now appeared in a good journal.)

So where does that leave us?

  • If you are a researcher, and you would be willing to review manuscripts openly the way I did, please let the world know!
  • If you think your work could interest me and you want to try a different type of peer review, please send me your paper!
  • If one of my papers interests you and you want to write a review and share it with me, please do! I also have software that needs reviewing.

I stress that you do not need to be affiliated with a college or have researcher as your title for this model to work. Anyone can write or review a research paper. (Admittedly, few people can write good papers or produce deep reviews but that is another story.)

I am not sure exactly how far we can go with such open peer review processes. But I think we can improve the current system significantly. To make my point stronger, I plan to write a blog post describing how I benefited from reviews and criticisms I have received through my blog and social networks. For now, please believe me: I received insights that I would never had received through the traditional peer review process.

What kind of researcher are you?

  • The politician. He will get people to collaborate on joint projects.

    How to recognize: He knows everyone!

    Pro: He makes things happen irrespective of the available funding!

    Con: Sends you emails on a Saturday.

  • The manager. His job is primarily to seek funding and recruit students. He supervises the work, setting directions and reviewing results.

    How to recognize: Will often publish with many student authors. Will receive (or seek) generous funding.

    Pro: A good manager can scale well. If you increase his funding, he will produce more results without necessarily sacrificing quality.

    Con: When the manager runs out of funding, his productivity might collapse. He might not stay close to the research: a manager is rarely passionate about the product, as he focuses on the process.

  • The clueless. He does not know what he is doing. He does not understand the business.

    How to recognize: Will religiously follow trends without any kind of critical thinking. Might stay close to a manager.

    Pro: Can be convinced to render useful services.

    Con: Might inflate fads. Will typically produce work of little significance.

  • The dreamer. He has a vision. He might be a perfectionist.

    How to recognize: Might not publish very much.

    Pro: Might just be the person to crack the hardest problem of your generation…

    Con: … but might also produce nothing. Good at wasting funding.

  • The artisan. He builds research projects from the ground up. Routinely, he will have complete ownership over a research project. Extra funding might not increase his productivity.

    How to recognize: Will publish significant work alone.

    Pro: He really cares about his research.

    Con: He might have a narrow focus.

  • The prince. He thinks very highly of himself.

    How to recognize: Though he might appear to be productive, he does nothing that require actual work.

    Pro: He is often dressed nicely.

    Con: Can have a bad temper.

  • The entrepreneur. He wants to change the world.

    How to recognize: Long-lasting focus on hard problems!

    Pro: Can have great results.

    Con: Might become a manager.

We are publishing a lot more! How will we cope?

The number of research articles published each year grows exponentially. We often estimate the rate of growth to between 4% to 6% a year.

We are publishing a lot more in Computer Science. Editors must work a lot harder than they used to.

According to Sakr and Alomari, the size of the database research community doubled in the last ten years, and so did the number of papers they publish. The next table shows the number of papers accepted by VLDB (a major venue in database research) each year.

  VLDB proceedings     number of articles  
2012 220
2002 110
1992 58

Journals also grow in popularity. For example, the Computer Journal has had to double the number of issues it publishes in less than 5 years.

  Computer Journal     number of issues  
2011 12
2010 10
2009 8
2008 6

Giving the rising costs of conferences and the cutbacks in research funding, we might even expect journals to grow faster than conferences.

Growth has superlinear effects: a system with twice as many variables isn’t merely twice as complex. Reviewers and editors have to work harder year after year even if their numbers increase. There is simply a higher coordination cost. And community sizes are limited by our cognitive abilities. This bound is called the Dunbar number. Despite what Facebook would have you believe, people cannot have many more than 150 acquaintances. So, if in the last 10 years, the number of researchers in your area has doubled, the number of researchers you can trust has probably remained the same.

In the past, this growth has lead to fragmentation. Scholars have become more narrow. But there is also a cost to this greater specialization: many of the most important problems require a broad expertise. Specialization often leads to irrelevance.

So far, we have kept disciplines from fragmenting by automating more and more of the peer review process. This trend is likely to continue. I believe that, one day soon, we will replace journal editors by robots. I think that Google is leading the way in this respect. For example, from my Google Scholar profile, Google recommends recent research papers to me. It is irrelevant where these papers appeared, as long as they are likely to be useful for my work. In some sense, we are bypassing human beings and scaling up with the help of computers.

The research enterprise of tomorrow will look a lot more like YouTube. You have millions of people crafting their content and hoping to attract some shred of attention. Much of the filtering and recommendation process is automated.

I am not claiming that relationships between researchers will become less important. In some sense, they will become even more important. In a world where you mostly interact with strangers you cannot trust, your trusted friends are key to preserving your sanity.

Further reading: The Future of Peer Review (via Venkat Rao)

The big-O notation is a teaching tool

One of my clients once got upset. Indeed, the running time of our image processing algorithm grew by a factor of four when he doubled the resolution of an image. No programmer would be surprised by this observation. (Hint: doubling the resolution multiplies the number of pixels by four.)

Indeed, all programmers eventually notice that some programs run much slower as more data was added. They know that if you try to sort data naively, doubling the size of the data multiplies the running time by four. It is a fundamental realization: scale matters when programming. Processing twice as much data is at least twice as hard, but often much harder.

We could wait for kids to learn this lesson by themselves, but it is more efficient to get them started right away on the right foot. Thus, one of the first things any computer science student learns is the big-O notation. They learn that printing out all values in an array takes O(n) time whereas sorting the same array with the bubble sort algorithm can take O(n2) time. The latter result is a fancy way of saying that doubling the data size will multiply the running time by a factor of four, in the worst case.

Simple models are immensely useful as teaching tools and communication devices. But don’t confuse teaching tools with reality! For example, I know exactly how a gas engine works in the sense that I once computed the power of an engine from the equations of thermodynamics. But General Motors is simply not going to hire me to design their new engines. In the same way, even if you master the big-O notation, you are unlikely to get a call from Google to design their next search engine.

Unfortunately, some people idealize the big-O notation. They view it as a goal in itself. In academia, it comes about because the big-O notation is mathematically convenient in the same way it is convenient to search for your keys near a lamp even if you lost them in a dark alley nearby.

The problem with the big-O notation is that it is only meant to help you think about algorithmic problems. It is not meant to serve as the basis by which you select an algorithm!

When asked why the algorithm with better big-O running time fails to be faster, people often give the wrong answers:

  • Our current computer architecture favours the other algorithm. What they often imply is that future computer architectures will prove them right. Why think that the future computer architectures will become more like simple theoretical models of the past? When pressed, are they able to come up with many examples where this has happened in the past?
  • With the faster algorithm having worse big-O running time, you are exposed to denial-of-service attacks. A good engineer will avoid switching to a slower algorithm for all processing, just so that he can avoid a dangerous special case. Often, a fast algorithm that has a few bad corner cases can be modified so that the bad cases are either promptly detected or made better. You can also use different algorithms depending on the size of the data.
  • If you had more data, the algorithm with better big-O running time would win out. Though, in theory, moving from a 10KB data set to a 10TB data set is the equivalent of turning a knob… in practice, it often means switching to a whole other architecture, often requiring different algorithms. For example, who can compare QuickSort against MergeSort over 10TB of data? In practice, the size of the data set (n) is bounded. It makes absolutely no practical sense to let n grow to infinity. It is a thought experiment, not something you can actual realize.
  • You don’t understand computational complexity. This is probably the most annoying comment any theoretician can make. The purpose of the big-O notation is to codify succinctly the experience of any good programmer so that we can quickly teach it. If you are discussing a problem with an experienced programmer, don’t assume he can’t understand his problems.
  • Using algorithms with high big-O running times is bad engineering. This statement amounts to saying that construction workers should not use power tools because they could cut their fingers off. Case in point: the evaluation of regular expressions commonly used in Perl or Java is NP-hard. A short regular expression can be used to crash a server. Yet advanced regular expressions are used everywhere, from the Oracle database engine hosting your bank account to your browser.

    On the general question of what is good engineering, then my view is that it is not about guaranteeing that nothing bad will happen because it will. Our software architecture is built on C and C++. Our hardware is overwhelmingly built without redundancies. Bad things always happen. I would argue that good engineering is being aware of the pitfalls, mitigating possible problems as much as possible and planning for failure.

Further reading: O-notation considered harmful, “In the long run…”

How fast should your dynamic arrays grow?

When programming in Java or C++, your arrays have fixed sizes. So if you have an array of 32 integers and you need an array with 33 integers, you may need to create a whole new array. It is inconvenient. Thus, both Java and C++ provide dynamic arrays. In C++, people most commonly use the STL vector whereas, in Java, the ArrayList class is popular.

Dynamic arrays are simple. They use an underlying array that might be larger than needed. As the dynamic array grows in size, the underlying array might become too small. At this point, we increase the underlying array. However, each time we do so, we have to copy all of the data over, allocate a new array and clear from memory the older array. That is a relatively expensive computation. To minimize the running time, we often grow the underlying array by a factor x (e.g., if x is 2, then the underlying arrays always doubles in size).

A nice result from computer science is that even if we grow the dynamic array by one element N times, the running time will still be in linear time because of the particular way we grow the underlying array. Indeed, roughly speaking, we have to copy about N + N/x + N/x2 + … or N x / (x – 1) elements to construct a dynamic array of N elements (for N large). Hence, the running time is linear in N.

However, the complexity still depends on x. Clearly, the larger x is, the fewer elements you need to copy.

  • When x is 3/2, we need to copy about 3 N elements to create a dynamic array of size N.
  • When x is 2, we need to copy only about 2 N elements.
  • When x is 4, we need to copy only about 1.3 N elements.

It would seem best to pick x as large as possible. However, larger values of x might also grow the underlying array faster than needed. This wastes memory and might slow you down.

So how fast do people grow their arrays?

  • In Java, ArrayList uses x = 3/2.
  • In C++, the GCC STL implementation uses x = 2.

The Java engineers are more conservative than the C++ hackers. But who is right? And does it matter?

To investigate the problem, I wrote a small benchmark in C++. First, I create a large static array and set its integer values to 0, 1, 2, … Then I do the same thing with dynamic arrays using various growth factors x. I report the speeds in millions of integers per second (mis) on an Intel Core i7 with GCC 4.7.

  x     speed (mis)  
static array 2500
1.5 240
2 340
4 480

Of course, this test is only an anecdote, but it does suggest that

  • dynamic arrays can add significant overhead
  • and that a small growth factor might be particularly slow if you end up constructing a large array.

To alleviate these problems, both the C++ STL vector and the Java ArrayList allow you to set a large capacity for the underlying array.

Of course, people writing high-performance code know to avoid dynamic arrays. Still, I was surprised at how large the overhead of a dynamic array was in my tests.

My source code is available.

Note: Yes, if you run your own benchmarks, the results will differ. Also, I am deliberately keeping the mathematical details to a minimum. Please do not nitpick my theoretical analysis.

Update: Elazar Leibovich pointed me to an alternative to the STL vector template created by Facebook engineers. The documentation is interesting. Gregory Pakosz pointed me to another page with a related discussion about Java.

Is learning useless stuff good for you?

We often require all students to learn things they may never need like latin, calculus, advanced trigonometry and classical literature. The implicit assumption is that learning difficult things is intrinsically good. It trains your brain. It makes you smarter.

True? Or false?

I worked on this assumption for the longest time. As an undergraduate, I took 6 courses per term instead of the required 5. I also took an extra year to graduate, doing the equivalent two majors. I probably took more college courses than 99.9% of the college graduates.

Why did I take all these courses? Because I was convinced that learning about all sorts of things would make me smarter. Many people think it works this way. That’s why we taught people Latin for a long time. In education, that is called transfer: learning something will help you learn something else, even if it is barely related. Does it work? We have reasons to doubt it:

Transfer has been studied since the turn of the XXth century. Still, there is very little empirical evidence showing meaningful transfer to occur and much less evidence showing it under experimental control. (…) significant transfer is probably rare and accounts for very little human behavior. (Detterman)

Caplan is even more categorical:

Teachers like to think that no matter how useless their lessons appear, they are teaching their students how to think. Under the heading of Transfer of Learning, educational psychologists have spent over a century looking for evidence that this sort of learning actually occurs. The results are decidedly negative.

These authors are not saying that learning French won’t help you learn Spanish. They are not saying that learning C++ won’t help you learn Java. Transfer does work, trivially, when there are similarities. Rather, they are saying that learning projective geometry won’t make you a better Java programmer. They are saying that learning fractal theory won’t help you be a better manager.

This has troubling consequences because, for many people, whatever they learned in college or in high school, has very little to do with what they do for a living. Does a degree in journalism makes you a better program manager today? You can legitimately ask the question. Yet employers are happy to assume that a degree, any degree, will help people do a better job, irrespective of the similarities between the job and the degree. For example, Tom Chi explains how his training in astrophysics made him a better business manager. From astrophysics to management? Really?

Can we at least hope that college students improve their critical thinking with all these literature, mathematics and philosophy classes? Roksaa and Arumb looked at the score of students on critical thinking tests as they progress through their studies:

A high proportion of students are progressing through higher education today without measurable gains in critical thinking.

The students have learned skills. It is difficult to go through years of studies without learning something. But this knowledge and these skills do not necessarily transfer to something as basic as critical thinking.

My point is that students might be onto something when they refuse to learn for the sake of learning. We look down at people who refuse to learn mathematics because it appears useless to them. We think that learning some mathematics would be good for them the same way we used to think that learning latin was good for the minds of little boys. We might be wrong.

But this has also a practical consequence for all of us: don’t bother learning skills “just in case” unless you do it for fun. If you want to be a better software programmer, just practice programming. This also means that if you want to acquire practical skills, a school might not be the best place to go: a degree in English might not turn you into a better novelist.

Another consequence is that you should not assume transfer of expertise: if someone succeeded at one thing, you should not assume they will succeed at something else. If a famous baseball player starts a software company, wait before investing.

Credit: This blog post was inspired by an online exchange with Seb Paquet.

XML for databases: a dead idea

One of my colleagues is teaching an artificial intelligence class. In his class, he uses old videos where experts from the early eighties make predictions about where AI is going. These experts come from the best schools such as Stanford.

These videos were not meant as a joke. When you watch them today, they are properly hilarious however. One of the predictions by a famous AI researcher was that the software industry would be dominated by expert systems by year 2000. This was a reasonable prediction: Wikipedia says that in the early 1980s, two thirds of the Fortune 1000 companies used expert systems in their daily business activities.

I believe that the majority of software programmers today would describe the importance of expert systems in their work to be… negligible. Of course, the researchers have not given up: the Semantic Web initiative can be viewed as a direct descendant of expert systems. And there are still some specific applications where an expert system is the right tool, I am sure. However, to put it bluntly, expert systems were a failure, by the standards set forth by their proponents.

Did you ever notice how much energy people put into promoting (their) new idea, and how little you hear about failures? That’s because there is little profit in calling something a failure and much risk: there are always people in denial who will fight you to the death.

I think it is unfortunate that we never dare look at our mistakes. What did Burke they say? “Those who don’t know history are destined to repeat it.”

When XML was originally conceived, it was meant for document formats. And by that standard… boy! did it succeed! Virtually all word processing and e-book formats are in XML today. The only notable failure is HTML. They tried to make HTML and XML work together, but it was never a good fit (except maybe within e-books). In a sense, the inventors of XML could not have succeeded more thoroughly.

Then, unfortunately, the data people took XML and decided that it solved their problems. So we got configuration files in XML, databases in XML, and so on. Some of these applications did ok. Storing data in XML for long-term interoperability is an acceptable use of XML. Indeed, XML is supported by virtually all programming languages and that is unlikely to change.

However, XML as a technology for databases was supposed to solve new problems. All major database vendors added support for XML. DBAs were told to learn XML or else… We also got handfuls of serious XML databases. More critically, the major database research conferences were flooded with XML research papers.

And then it stopped. For fun I estimated the number of research papers focused on XML in a major conference like VLDB: 2003: 27; 2008: 14; 2012: 3. That is, it went from a very popular topic for researchers to a niche topic. Meanwhile, the International XML Database Symposium ran from 2003 to 2010, missing only year 2008. It now appears dead.

That is not to say that there is no valid research focused on XML today. The few papers on XML accepted in major database journals and conferences are solid. In fact, the papers from 2003 were probably mostly excellent. Just last week, I reviewed a paper on XML for a major journal and I recommended acceptance. I have been teaching a course on XML every year since 2005 and I plan to continue to teach it. Still, it is undeniable that XML databases have failed as anything but a niche topic.

I initially wanted to write an actual research article to examine why XML for databases failed. I was strongly discouraged: this will be unpublishable because too many people will want to argue against the failure itself. This is probably a great defect of modern science: we are obsessed with success and we work to forget failure.

How I learned mathematics (as a kid)

As I reported elsewhere, I technically failed kindergarten. For example, one of the test we had to pass was the memorization of our home phone number. I refused to learn it. My mother, a teacher, was embarrassed. We also had to learn to count up to 10. I was 5 so I decided to it was more reasonable to learn to count up to 5. My mother was again embarrassed.

So it was decided that I must have had a learning disability. I was obviously bad at mathematics. (This last sentence is ironic: mathematicians will tell you that memorizing numbers is not mathematics. But I digress.)

For those who don’t know me… I have a three degrees in mathematics from some of the best schools in the world. I have also published some novel mathematical results. I am not a star mathematician or mathematical genius but I have credentials. Yet if my teachers had to make predictions based on my early schooling, they would have predicted nothing good for me. At least, nothing good in mathematics.

In retrospect, I am quite certain that I have never had a learning disability… except for the fact that I am an incorrigible contrarian. Still, my parents are not obviously good at mathematics. I see no evidence that I liked numbers. So how did I get good enough to outdo my peers?

Because I did not record my childhood, I can only speculate. Here is what I remember.

As a kid, I learned to read with Tintin. And my favorite character was professor Calculus (known in French as Tournesol). I also loved scifi. At the time, Star Wars was very popular. I remember dreaming of the year 2000 when I would get to fly in a starship.

As a parenthesis, I distinctly remember learning how to read for the purpose of reading Tintin. And Tintin was not part of the curriculum. Rather, my mother got me one album, and it was the most exciting thing I had in my room! I remember painfully deciphering Tintin, page by page.

In any case, I did not know much about Physics or Chemistry, but I knew that whatever Calculus did had to do with mathematics. I also knew that flying starships and building robots involved advanced mathematics.

So I was motivated to learn mathematics. That is probably the single most important factor. I simply wanted to be good at mathematics. When I got something wrong, I did not get discouraged, I tried to understand it better.

I also think that my contrarian nature helped me. It made me immune to the poor teaching of mathematics so prevalent in schools. For example, while my peers were memorizing multiplication tables, I tried to find algorithms to figure out the answer. I simply could not imagine professor Calculus memorizing tables. After all, professor Calculus is known to be forgetful!

Still, where did I learn mathematics? We did get some decent mathematics in the classroom from time to time, but on the whole I think it was mediocre. The manuals were simply not very inspired.

However, I discovered a magazine called Jeux et Strategie (Games and strategies) as a kid. It was an amazing magazine. Each month, it had pages and pages of fun mathematical puzzles. An ongoing theme was that of a race of aliens where some of them always told the truth, some of them lied all the time and some of them would just say anything. You could not tell them apart, except by analyzing what they said. This was my introduction to logic. Initially, the puzzles were way too hard for me. By the time the magazine stopped printing, I could do these logic puzzles in my head.

The magazine would discuss games like poker and monopoly. However, it would do so in a sophisticated manner. For example, I remember this article about monopoly from a top-rated player. He showed how the good players used probabilities to win. That is, you are not just supposed to buy any lot! Some are better than others, and you can easily figure out which ones are better.

I don’t play games a lot, but I really liked the idea that I could learn mathematics to beat people at games. It turns out that I never did become a better monopoly player, but I learned that if I used the right mathematics, I could!

As an aside, my grand-mother was a gambler. She would hold these poker games at her place every week-end. And they played for real money! She also brought me all the time to horse races (she had racing horses of her own). If you have never been to horse races, you should know that you get lots of statistics about the horses. It tells you exactly how often a given horse has won, and in what conditions. One of my early hobbies, as a kid, was to read these statistics and try to predict the winners. After all, I had nothing better to do (horse races are otherwise quite boring for kids). I even devised some algorithms that were fairly reliable. This taught me that you could actually use mathematics to get money!

The final step in my early mathematical education came when I got a computer. My parents gave me a TRS-80 color computer. I simply did not have much money to buy games. So I had to program it to stay entertained. Obviously, as a kid I decided to design my own games. I did not get nearly as far as I thought I would. I guess I was never very motivated in building a really good game since I had no way to share it. But I did build a few and this taught me a lot about discrete mathematics. I remember having to work out my own collision detection algorithms (how do you figure out whether a point has crossed a line?). I also got a lot out of magazines. At the time, magazines would regularly post the source code of simple games. This was just great! You could take an existing game and try to improve it, to see what would happen.

All along, what helped was that I had a friend who was a nerd too. He ended up becoming a software programmer too. I am sure that if all my friends had been into sports, it would have been much harder for me to stick with mathematical interests.

To sum it up, here are the factors that helped me become good at mathematics:

  • Early on, I self-identified with scientists. I had a role model (professor Calculus).
  • I have always been a contrarian: I refuse to accept things on faith. I am not sure where this came from. I doubt it is an innate trait, but I also do not know how to cultivate it in others. In any case, this plays an important role because I always refused to accept recipes. I think recipes are a terrible way to teach mathematics.
  • I had access to decent and entertaining mathematical content, even if it wasn’t from the school I attended.
  • I got my own (programmable) computer as kid!
  • I hung around with nerds.

I am not claiming that this is some sort of recipe to turn kids into mathematicians. My real point is that I believe that mathematics is not innate. I also do not think that schools can teach mathematics. Not the kind of schools I attended.

Government regulations… as software

Socialists accuse me of being a libertarian. Libertarians accuse me of being a socialist. I am actually a pragmatist: I believe that we should set things up to maximize our collective well being.

Government regulations are complicated. In Canada, our tax code is so complex that I doubt anyone ever fully knows it.

Regulators and law makers have often no experience building complex software. Yet what they often do is basically software programming. The difference is that, for now, the software runs on brains.

Laws are often buggy. Law makers can contradict themselves. For example, Chinese law gives the father the right to decide on whether his wife can have an abortion. Yet Chinese law also ensures that the mother is free to decide whether to have a child or not.

I think we should treat laws and regulations as software:

  • Laws and regulations should be considered in constant beta testing, just like Google’s software. We should accept up front that there are bugs, possibly major bugs.
  • There should be a cheap and transparent way to report bugs. Anyone should be able to report them and reports should be discussed publicly.
  • Laws and regulations should be systematically tested. For example, regulations could be introduced only locally, and at first only tentatively. You think you have a good idea to fight global warming? Try out your new regulations on a test basis and see whether it works. We should make it easy, not hard, to pull back a new law or regulation. We should assume that the first version of anything is buggy.
  • When a contradiction or other mistake is found, we should not just pretend that it does not exist. Instead, the law or the regulation should be patched. Anyone should be able to propose a patch. (You do realize that anyone can submit a patch to the Linux kernel right now?)
  • We should be able to trace back exactly who revised what and when. It should be trivial for anyone to browse the history of a regulation. Laws should be treated the same way: we should publish not only the law, but all its revisions with a detailed account of who proposed what.

Governments are full of bugs

The famous scifi author (and self-described libertarian) David Brin tells a fascinating tale about how he contributed to getting the lead out of gasoline. He explains that an evil corporation (Ethyl Corporation) promoted leaded fuel. He explains that he helped organize a clean-car race where unleaded fuel did well. He suggests that this changed everything: by 1972, the government finally regulated the use of leaded fuel. It is a good thing too because lead is a terrible poison.

We have the evil corporation, the good activist and the benevolent government. We should be grateful for government regulations!

Of course, the tale is not complete.

The very high toxicity of lead was known well before 1970s. The use of lead in paint was prohibited in many countries at the beginning of the XXth century.

As early as 1920, we knew how to avoid lead with alcohol-gasoline blends. They were soon commercially available too.

Yet, by 1939, almost all gas sold in the US was lead-based, except for Sunoco. So how could lead-based gasoline take over? How could a toxic fuel win over safer alternatives?

The story is complicated, but we can identify two important factors in the US:

  • The US Federal Trade Commission declared lead-based gasoline safe and it prevented competitors from denouncing it. Effectively, the government prevented competitors from telling consumers that they could buy a safer product. It also made lawsuits against the proponents of lead-based gasoline improbable.
  • The main alternative to lead-based fuels was ethanol-based… at a time when the US had enacted the prohibition of alcohol and pushed underground its distilleries. Alcohol-based cars must have sounded a bit like cannabis-based cars today.

But Brin’s story is still historically accurate: by 1972, the US government did the right thing and banned the dangerous lead-based gas. Finally! Some courageous and altruistic politicians rose up! Evil companies like General Motors are finally told to do the right thing!

Wait? You really think that’s how I am going to finish the tale?

In 1970, General Motors had asked for the elimination of lead from gasoline to make its new catalytic converters workable.

So, what do you think got the government to eliminate lead from gasoline? David Brin and his forward-thinking clean-car race… or General Motors lobbying for its new technology?

But surely, Daniel, you don’t mean to tell us that politicians never stand up for what is right? Didn’t we solve the ozone layer crisis with the Montreal protocol? Didn’t the US government stand up to evil corporations like Dupont?

In fact, the regulations of the Montreal protocol were described by Dupont as having a tremendous market potential. Indeed, Dupont became favorable to the Montreal protocol once they acquired the patents (in 1986) that would allow them to market monopolistically their replacement technology. The Montreal protocol was signed a year later (1987).

This is a common pattern. People who hate corporations tend to be favorable to government regulations because they think that one opposes the other. They think regulations keep corporations in check. They sometimes do, but they often make things worse.

Regulating a society is like writing software. It sounds easy to amateurs, but it can be amazingly difficult. Software programmers typically cannot prove that their software will work. At best, in many practical cases, they can test it for typical cases. Unfortunately, government regulators don’t follow any of the good practices software programmers use. Worse: they never accept bug reports. Worse: they are often in an adversarial setting.

Any programmer will tell you what to do when testing is hard and failure is not an option: keep the software as simple as possible.

That’s not at all what government do. And, unsurprisingly, they are full of bugs.

Further reading:

Dedication: This post is dedicated to my family where, when something fails, we say that is has a bug.

Experience is everything

We learned recently that one of the leading opponents to genetically modified organisms (GMOs), Mark Lynas, decided that he had it all wrong. GMOs save the Earth by reducing the need for pesticides, getting poor farmers out of misery, freeing land for wild life and generally keeping human beings from starving. The evidence is overwhelming. Yet I have had some long winded arguments with GMO opponents. For example, many of them would require labelling on the basis that they have to right to know what they eat. My reply is always the same: if we are going to label GMOs, then let us also add to the labels statistics about pesticide use, land use, and farmer well being. At this point they almost invariably tell me that I am secretly funded for a big corporation. (I wish this were true: if you are a big company that wants to fund me, please get in touch.)

Environmentalists have strong beliefs. Part of their dogma is that industry is bad, and industrial progress is evil. That is false. The industrial revolution pulled out most of the population from abject poverty. The switch to an oil industry saved the whales from extinction. When computers and the Internet came about, its users were accused of being asocial. Today, if you aren’t on Facebook, it is because you hate other people (I barely exaggerate).

So, how did Mark Lynas pulled out of the dogma? How could you have changed his mind?

It turns out that Mark had to do a lot of hard work to write books about global warming. He wanted to write science books:

I had to learn how to read scientific papers, understand basic statistics and become literate in very different fields from oceanography to paleoclimate, none of which my degree in politics and modern history helped me with a great deal.

This work rewired his brain. He wrote science and became through this experience a scientist. Once he had done this hard work, and only then, was he finally receptive to counter-arguments regarding GMOs.

How do you change someone’s mind? You get them to experience something new and significant. Lazy people do not change their mind.

And this is essentially my approach as a teacher. I setup students with hard problems. I ask them to solve a problem that might be beyond their ability. Some hate me and my class for a time. But they are stuck with it and must work. I could try to teach them the material until I am blue in the face, but I find that getting them to struggle with a problem is far more efficient to get them to learn what matters.

So, maybe in 2013, you want to become a different person. Maybe you think that you will read some books about it. Or maybe take a class. Instead, I urge you to focus on experience. You want to become a writer? Write a book. You want to become a programmer? Write a program.

I am always amazed at these students who want to enter the software industry, but they seem to program only when absolutely necessary. You know how you become a great programmer? It starts with programming a lot.

You want to convince other people? Think about what they would need to experience to change their minds. Want to convince me that we should use state regulations to stop global warming or to achieve a fairer society? You have no chance in hell to argue your point with me. I went to what was East Berlin. I walked in the streets reflecting on what a government-run industry leads to: massive pollution and misery. I have spent more than half my life in government organizations: my experience taught me to distrust their initiatives.

Experience is what anchors beliefs. You are, more or less, what you experienced. To change people you have to get them to experience new things.

So, how do you convince a stranger who doubts you? You invite him on an adventure. You offer him a new recipe. We are very tempted to build up sophisticated arguments. It does work in the sense that our ability to put forth long winded arguments might improve our social status. Leaders tend to be people who talk well. People tend to follow people who speak their mind. But if you must change someone’s mind, if you must fight a dogma, then arguments are almost always useless. If the individual is not prepared for your arguments, they will just bounce back.

Are CAPTCHAs a good idea?

A CAPTCHA is a small test used to distinguish human users from robots. They are popular as an anti-spam tool.

Until a few months ago, I had an annoying CAPTCHA on this blog. I have since removed it and I will not go back.

What happened?

  1. The long-term problem with CAPTCHAs is that computers are getting so good at passing the Turing tests that we must stretch the cognitive abilities of human beings to distinguish machines from human beings. Thus, we end up requiring users to make greater and greater effort. It is simply unsustainable. It is a race that can only end up as a victory for the spammers.
  2. I thought, naively, that I could get around this problem with a home-made CAPTCHA. After all, I am certainly not important enough for spammers to write code specifically to pass my CAPTCHA. Unfortunately, spammers appear to be recruiting human beings. There is a large pool of people on Earth who will gladly get paid just to post spammy comments on minor blogs. Thus, no matter how good you are at distinguishing human beings from bots, you still cannot win with CAPTCHAs.
  3. Though not perfect, automated spam detection has gotten quite good. For my blog, I use the free service Akismet. It can stop most naive attempts to spam bloggers. I also have some fixed rules that will sent a comment directly in the spam box. There is a small fraction of the legitimate comments that I will never get to see, but this is already true with email. I have come to grasp with the fact that messages online sometimes get lost.

So the default on this blog is that comments go to a moderation queue and I have to approve them, one by one. About half of the comments that pass my filters are still spam. If I were hosting a more popular service, I would probably still find a way to prevent abuse without using CAPTCHAs.

Credit: Thanks for John Regehr for inspiring this post.

Update: Sathappan Muthu pointed out to me a very cool CAPTCHA service: http://areyouahuman.com/.

Reflecting on 2012

The new year (2013) is here. So, it is time to reflect on what I have done and seen in 2012.

As a researcher, one of the most interesting innovations in 2012 has been the emergence of the Google Scholar profiles. They are pages where Google aggregates the work of a given researcher. I have long advocated that we should pursue an author-centric model and I think we are finally getting there: Google Scholar allows you to subscribe to the new publications of an author. Unfortunately, Google has focused on the profile pages on citations: this encourages people to game the system by boosting artificially their citation counts. We will need better metrics if we want to openly assess researchers. To improve matters, I have started a small project described in my post From counting citations to measuring usage.

What is rather remarkable is that Google is actually disrupting the academic publication business… even though it is almost certainly not profitable for Google to do so. They are definitively creating value: it was far easier to find scientific references in 2012 than it was before thanks to the progress Google is making with Google Scholar.

Similarly, Google has kept pushing the state-of-the-art in database research with papers such as Processing a Trillion Cells per Mouse Click. This is remarkable especially if you consider that in academic circles, the physical design of databases has long been considered a solved problem. Clearly, academic wisdom was wrong.

Academics like myself often like to pretend that innovation starts with academic research which then migrates to industry where the ideas are implemented. Google puts a dent in this theory, as they are clearly driving research in Computer Science. Over the Christmas break I finally got around to reading Kealey’s Sex, Science And Profits who argues convincingly that this is true in general: the magical knowledge transfer from academic research to industry would account to only about 10% of all industrial innovations, and not the most valuable 10%.

I think that 2012 was a year when many academics like myself thought about the future. We have seen the emergence of massive open online courses from prestigious American universities and a few start-ups. It is becoming clear that governments are under increasing financial stress while universities have failed to become more efficient. It is hard to imagine that universities will remain undisrupted for another 20 years.

Though blogging was supposed to be dying in 2012, I am still writing and reading blog posts almost every day. One of the interesting innovation on this my blog has been the use of the social coding platform GitHub for hosting the code related to my software posts. I really like it because it allows people to not only comment on my posts, but also review and change easily my code. This has lead to more interesting discussions.

I have also started using GitHub for some of my research projects. And I have been lucky enough to get feedback from people who were interested in my code for their own reasons. The fact that GitHub makes it easy to contribute to a stranger’s code is a blessing for me as a researcher. In fact, GitHub has encouraged me to focus even more of my research on programming by making it easier to have an impact through software.

Though I want to resist making predictions, I think that in 2013, more of my time will be spent programming openly. I will probably move closer to an ideal of open scholarship where research papers are only one of visible outputs of my research. I also hope to move closer to an ideal where much of my research is actually useful. It is maybe worth repeating here that most of what researchers write is never read by anyone (even when it is cited). We have focused on publishing so much that we ended up believing that it is an end in itself (it is not!).

I no longer measure the popularity of my blog using statistics. I feel that it is rather pointless given how many bots and dead subscriptions there are. However, I still assess the value of the blog based on the interactions it generates. Subjectively, 2012 was a good year: I got a lot of very interesting feedback.

Several of my blog posts were tied with my ongoing research. In an ideal world, I’d be able to decompose much of my research into short blog posts: I think I am getting closer to this model. In this respect, I am inspired by the famous Edsger Dijkstra who published little in journals and conferences: he thought that the formal peer review system was counterproductive. I am no Dijkstra, of course, but I find his model very compelling. I still value the feedback I get from a good journal review, and it certainly helps me improve my work (and the work of my graduate students), but I now see it as just one tool among many.

My blog publication rate has come down. This is part of a long term trend caused by the increased usage of the social platforms such as Google+ and Twitter and the collapse of RSS as a platform. That is, much of what I would have published on this blog back in 2005, I now publish on Google+ or Twitter. I tend to stick with my blog for the more substantial pieces.

Here are some popular blog posts in 2012:

  1. Do we need patents?
  2. What happens when you get more Ph.D.s?
  3. I’m an introvert. And that’s ok.
  4. Computer scientists need to learn about significant digits
  5. Data alignment for speed: myth or reality?
  6. On the quality of academic software
  7. Is C++ worth it?
  8. To improve your intellectual productivity
  9. Fast integer compression: decoding billions of integers per second
  10. When is a bitmap faster than an integer list?
  11. Should you follow the experts?
  12. A simple trick to get things done even when you are busy

Of course, my life is not just about writing software, grading papers and blogging. In 2012, I kept making things by hand:

  • I made 5 tables that are nice enough that my wife put them in the living room. There is almost a piece of furniture I made myself in every room of our house. Two years ago I would have been unable to build a bird house (if only because I lacked the tools and material).
  • I started square foot gardening and had generally good luck.
  • I got much better at making bread. I now use a couple of variations on Lahey’s recipe. It is ridiculously easy and it works every time! I get beautiful results and everyone likes my bread. I have also much improved my pizza bread. So I am now making my own yogourt, wine, port, beer and bread, all from scratch.

Maybe I should conclude with some of the mistakes I made in 2012:

  • I designed almost from scratch an RC crawler. It probably costed 5 times what it should have costed because I did not know what I was doing. It was fun in the end, but I am happy my wife does not know how much it costed.
  • I tried fixing an iPad with a broken glass by myself. I ordered the pieces from Hong Kong and did almost everything correctly, but when I put the iPad back together, half of the screen was darker so the iPad became unusable. I also started fixing a broken glass on an Asus tablet but I got discouraged and I stopped. I wasted a lot of money with parts.
  • I overcommitted professionally. A friend told me that what he likes about my academic record is that I do few things, but I do them well. But 2012 was a bad year for me in this respect. I just agreed to do too many things. Maybe it is my introverted personality, but I just do poorly under pressure: I tend to get more and more exhausted and I achieve less and less as the pressure builds up. I find that I need to focus on 3 or 4 important things at most. Now that the Christmas break is ending, I feel a bit anxious about everything I need to complete. I hope than in 2013, I will refuse to overcommit and just focus on doing things well. It is harder than it sounds.

Why do students pay for the research professors do?

Universities require their professors to publish research papers. Yet publishing your research has little to do with most of the teaching that goes on in universities. And with online teaching, we can almost completely separate teaching from research. Yet we are typically happy to dismiss these concerns by pointing out that universities have also a research purpose. But this answer is not entirely satisfying: who gets to decide what universities should do beside provide degrees and teaching?

There was a long student boycott in Quebec in 2012 that attracted worldwide attention. Students asked for free higher education. One of their core arguments for cheap tuition was that much of the university budget goes to support research. Apparently, many students would rather not pay for research. That is, if universities have to do research, then it is up to the government (or to industry) to fund it.

I think that the Quebec students are misguided when it comes to research. So let us ask why universities do research.

How did we get started? The first universities were entirely funded by the students. In fact, they were often run by the students themselves. Yet, even back then, the professors published and communicated their research eagerly. Why?

There would be great savings if we could just get rid of research. It is easy to find people willing to lecture for a fourth of the salary of a professor.

But professors are not lecturers even though they sometimes lecture. Students seek to be close to people who are perceived as leaders in their respective area. They do so because recognition from such a leader is highly valued socially. And to recruit and retain leaders, you need to pay a decent salary.

The principle is general. If you are a well-known porn star, it is likely that there are people who will pay just to get some coaching from you, precisely because you are known as a porn star. So, a computer science professor should try to be known as a computer scientist. Then students who want to become computer scientists will want to have access to this professor.

Publishing is a very natural process if you want to build up your reputation. In fact, many people who write non-fiction books do so because it will attract indirect sources of income such as consulting and training. Professors are not different: they write books and research articles because it increases their social status. In turn, this social status can be monetized.

Thus, if you want to know whether a professor is doing a decent job, you should ask whether people would be willing to pay to interact with him if we did not have universities. A good professor should be able to fund much of his research with consulting and training contracts, should he ever lose his academic position. Hence, his employer gets a fair deal even if it has to allow him to spend a great deal of time on self-directed research.

Students who are merely interested in some cheap teaching can find affordable people willing to tutor them. But that is not what motivates students to attend universities. They seek prestige. This is is why professors have to act as role models. That is why we need to pay them to publish.

My argument is falsifiable. The Quebec government could create new universities that are focused entirely on teaching. They would do no research, but they would have lower tuition fees. If students really do not care for research, they should all flock to these new universities. Of course, teaching-only universities do exist and though they attract their fair share of students, they have never disrupted conventional universities.

A simple trick to get things done even when you are busy

I have been overwhelmed with busy work for the last couple of weeks. Unavoidably, this has meant shorter emails, fewer follow-ups with collaborators and not as much blogging as I would like. I probably missed a few commitments as well.

However, despite all the turmoil of responsibilities, committees and grading, I still get a little bit of core research done every week. I do have a little secret weapon. It will likely only work with one aspect of your life, but it will work well. It is called “pay yourself first”.

Imagine a typical workday. You look at your schedule and it is crazy. At some point, you have half an hour that you can spend how you want. What do you do? Probably you have 50 unanswered emails, a few forms to fill out, some advice to give out, a grant proposal to review… What these tasks have in common is that they don’t require a lot of energy but they consume time. They also tend to be important for someone else. They are tempting because with relatively little energy, you can reduce the length of your to-do list and immediately feel good about yourself. And someone might notice right away that you did them! These tasks are like candies. You feel good and energetic when doing them, but they drain you out eventually.

Instead, look at what you figured was important in your life. For example, working out, writing a novel or a research paper. Do spend time doing this important work first. I do not mean that you should sacrifice the other aspects of your life, only that in your schedule, you should complete some tasks before the others.

The key insight is to recognize that there are different types of work, just like there are different types of food, and that the order in which you do your work matters. If you eat dessert first, you will never be hungry the meat and the vegetables.

Rationally, it does not seem to make sense. Why would it matter when you work on your novel? But consider that busy work has a way to expand and take up all your time. It is because it is so tempting to work on short-term problems, and there is never a shortage of those.

I suspect that this issue has deep roots. Our ancestors were probably most successful when they worried most about day-to-day issues. We are tempted to do the same.

Why I like the new C++

I was reading Vivek Haldar’s post on the new C++ (C++11) and I was reminded that I need to write such a post myself.

C++ is a standardized language. And they came up with a new version of the standard called C++11. Usually, for complex languages like C++, new versions tend to make things worse. The reason is simple: every vendor is eager to see its features standardized. So the whole standardization process becomes highly political as mutually contradictory features must be glued on. But, for once, the design-by-committee process worked really well: C++11 is definitively superior to previous versions of C++.

In one of my recent research projects, we produced a C++ library leveraging several of the new features of C++11. The use of C++11 is still a bit problematic. For example, our library only compiles under GCC 4.7 and better or LLVM 3.2.

So, why did we bother with C++11?

1. The move semantics

In conventional C++, values are either copied or passed by reference. If you are adding large objects (e.g., the data of an image) to a container such as an STL vector, then they are necessarily copied unless you do tricks involving pointers and other magical incantations. Performance-wise, this is absolutely terrible!

The new C++ introduces the move semantics: you can move data to a container (such as an STL vector) without copying it! For example, the following code took over 4s to run using the regular C++, and only 0.6s using C++11: it is a performance boost by a factor of 5, without changing a line of code.

vector<vector<int> > V;
for(int k = 0; k < 100000; ++k) {
    vector<int> x(1000);
    V.push_back(x);
}

2. No hassle initialization

Initializing containers used to be a major pain in C++. It has now become ridiculously easy:

const map<string,int> m = {{"Joe",2},{"Jack",3}};

There are still a few things that I could not get to work, such as initializing static members within the class itself. But, at least, you no longer waste minutes initializing a trivial data structure.

3. Auto

STL uses iterators. And iterators are cool. But old C++ forces you to fully qualify the type each time (e.g., vector::iterator) even when the compiler could deduce the type. It gets annoying and it makes the code hard to parse. C++11 is much better:

vector<int> x = {1,2,3,4};
for(auto i : x)
  cout<< i <<endl;

These lines of code initialize a container and print it out. I never want to go back to the old ways!

4. Constexpr

Suppose that you have a function that can be called safely by the compiler because it has no side effect: most mathematical functions are of this form. Think about a function that computes the square root of a number, or the greatest common diviser of two numbers.

In old-style C++, you often have to hard-code the result of these functions. For example, you cannot write enum{x=sqrt(9)}. Now you can!

For example, let us define a simple constexpr function:

// returns the greater common divisor
constexpr int gcd(int x, int y) {
    return (x % y) == 0 ? y :  gcd(y,x % y);
}

If gcd was just any function, then it might be called multiple times in the following loop, but thanks to C++11, it will never get called when the program is running (just once by the compiler):

for(int k = 0; k < 100000; ++k) {
    vector<int> x(gcd(1000,100000)); 
    V.push_back(x);
}

(Naturally, a buggy C++ compiler might fail to optimize away the constexpr function call, but you can then file a bug report with your compiler vendor.)

Conclusion

C++11 is not yet supported by all compilers, but if you are serious about C++, you should switch to C++11 as fast as possible.

As usual, my code is available from github.

What I do with my time

I am not a very productive person. I also do not work long hours. However, I sometimes give the impression that I am productive. I have been asked to explain how I do it. Maybe the best answer is to explain what I do on a daily basis.

We are Wednesday. Here is what kept me busy this week:

  • Every night I spend an hour playing video games with my two sons. We finished Xenoblade a few days ago. It took 95 hours to finish the game, so about 95 days. We are playing Oblivion right now.
  • This week I spent an afternoon grading papers.
  • I spent probably an afternoon preparing a new homework question for my database course.
  • I spent almost an entire day reading a Ph.D. thesis and writing up a report.
  • I spent a morning running some benchmarks on an AMD processor. It involved some minor C programming.
  • I am writing this blog post and I wrote another one earlier this week. I often spend 3 hours blogging a week. Sometimes less, sometimes more. I also read comments, here and on social networks and I try to react.
  • I am part of some federal committee offering equipment grants. It took me a few hours to fill out a form this week.
  • I have been asked to be on the tenure review committee for the business school as an external. I have reviewed 4 or 5 files, looked up research papers, read the corresponding c.v.s and written up some notes.
  • I made bread twice this week. I make all the bread our family eats.
  • I spent a few hours working on furniture. I am building my own furniture for our living room as well as a few custom things for the house.
  • I spent a few hours on Google+ arguing with people like Greg Linden.
  • I spent a few hours exchanging emails with various people including graduate students.
  • Because I chair a certificate program, I had to answer a few questions from students. This afternoon, I wrote a long email to help the program coordinators. We are going to build some kind of FAQ for students.
  • I was asked whether my graduate data warehouse course would be offered next term. I explained to the chair of the IT M.Sc. program that it would be offered but that students can’t register right now.
  • Because there is much demand for this upcoming graduate data warehousing course, I prepared a long email with supporting documents to help move this along. I will be offering three graduate courses next term. And yes, I do all the grading myself. I am currently offering two though most of my teaching time is used up by the undergraduate database course that I am offering for the first time this term.
  • I spent a few hours this week arguing with a database student that, well, it is not ok if he is weak in mathematics. He needs to build up his expertise.
  • Someone submitted a problem to me about transcoding UTF-8 to UTF-16. We exchanged a few emails and I proposed a data structure along with a reference paper. This may eventually become a blog post.
  • I spent some time worrying that I am still without news about a paper I submitted 9 months ago to a journal.
  • I sometimes supervise Ph.D. students in the cognitive computer science program. The program is being reviewed right now. I spent my morning on Monday at a meeting with external experts.
  • Tomorrow, I have two administrative meetings occupying the full day.
  • I reviewed a report and some code from a Ph.D. student I co-supervise.
  • I picked up my kids from school yesterday and today. My wife did it Monday and she will do it again tomorrow.
  • I watched a dozen videos on YouTube. I have this amazing (but cheap) TV that can display YouTube videos. I really liked Using Social Media To Cover For Lack Of Original Thought.

I would qualify the current week as busy, but not extraordinarily so. It would be a much more relaxed week if I did not have a full day of meetings tomorrow.

Some things that I have not done:

  • I have not checked that this blog post has good spelling and grammar.
  • Before going to sleep, I watch a TV show or read books on my iPad. These days, I am watching Once upon a time. However, I do not watch broadcast TV.
  • I owe a couple M.Sc. students a meeting. I promised to email them last week, but I have not yet done so.
  • I have delayed a few meetings that were supposed to happen this week.
  • I was planning to do some follow-up work this week on a couple of research projects, but it looks doubtful that I will have any time at all. I constantly worry that I am just keeping busy instead of doing the important work.
  • I am trying to read The Art of Procrastination: A Guide to Effective Dawdling, Lollygagging and Postponing, but I am not making much progress. (This is not a joke.)
  • I used to spent a lot of time following blogs. I do not have much time for this anymore.
  • I am also not attending strip clubs or doing threesomes. I do not have a mistress.
  • Unlike one of my friends, I do not run a farm.
  • I do not travel.
  • I am not active in politics.
  • Other than swearing, I have no religious activity.
  • I do not have a side business. I do not consult.
  • I do not workout. (I compensate by drinking coffee.)
  • I do not clean up my office.
  • I do not shop for clothes.
  • Though I shower every day, I do not spend any time at all trying to look nice. I pick my clothes randomly in the morning. I am sometimes astonished how business-like some of my colleagues look.

Conclusion: I do not know what to conclude, but I am interested in how what I do differs from what other people do.