The big-O notation is a teaching tool

One of my clients once got upset. Indeed, the running time of our image processing algorithm grew by a factor of four when he doubled the resolution of an image. No programmer would be surprised by this observation. (Hint: doubling the resolution multiplies the number of pixels by four.)

Indeed, all programmers eventually notice that some programs run much slower as more data was added. They know that if you try to sort data naively, doubling the size of the data multiplies the running time by four. It is a fundamental realization: scale matters when programming. Processing twice as much data is at least twice as hard, but often much harder.

We could wait for kids to learn this lesson by themselves, but it is more efficient to get them started right away on the right foot. Thus, one of the first things any computer science student learns is the big-O notation. They learn that printing out all values in an array takes O(n) time whereas sorting the same array with the bubble sort algorithm can take O(n2) time. The latter result is a fancy way of saying that doubling the data size will multiply the running time by a factor of four, in the worst case.

Simple models are immensely useful as teaching tools and communication devices. But don’t confuse teaching tools with reality! For example, I know exactly how a gas engine works in the sense that I once computed the power of an engine from the equations of thermodynamics. But General Motors is simply not going to hire me to design their new engines. In the same way, even if you master the big-O notation, you are unlikely to get a call from Google to design their next search engine.

Unfortunately, some people idealize the big-O notation. They view it as a goal in itself. In academia, it comes about because the big-O notation is mathematically convenient in the same way it is convenient to search for your keys near a lamp even if you lost them in a dark alley nearby.

The problem with the big-O notation is that it is only meant to help you think about algorithmic problems. It is not meant to serve as the basis by which you select an algorithm!

When asked why the algorithm with better big-O running time fails to be faster, people often give the wrong answers:

  • Our current computer architecture favours the other algorithm. What they often imply is that future computer architectures will prove them right. Why think that the future computer architectures will become more like simple theoretical models of the past? When pressed, are they able to come up with many examples where this has happened in the past?
  • With the faster algorithm having worse big-O running time, you are exposed to denial-of-service attacks. A good engineer will avoid switching to a slower algorithm for all processing, just so that he can avoid a dangerous special case. Often, a fast algorithm that has a few bad corner cases can be modified so that the bad cases are either promptly detected or made better. You can also use different algorithms depending on the size of the data.
  • If you had more data, the algorithm with better big-O running time would win out. Though, in theory, moving from a 10KB data set to a 10TB data set is the equivalent of turning a knob… in practice, it often means switching to a whole other architecture, often requiring different algorithms. For example, who can compare QuickSort against MergeSort over 10TB of data? In practice, the size of the data set (n) is bounded. It makes absolutely no practical sense to let n grow to infinity. It is a thought experiment, not something you can actual realize.
  • You don’t understand computational complexity. This is probably the most annoying comment any theoretician can make. The purpose of the big-O notation is to codify succinctly the experience of any good programmer so that we can quickly teach it. If you are discussing a problem with an experienced programmer, don’t assume he can’t understand his problems.
  • Using algorithms with high big-O running times is bad engineering. This statement amounts to saying that construction workers should not use power tools because they could cut their fingers off. Case in point: the evaluation of regular expressions commonly used in Perl or Java is NP-hard. A short regular expression can be used to crash a server. Yet advanced regular expressions are used everywhere, from the Oracle database engine hosting your bank account to your browser.

    On the general question of what is good engineering, then my view is that it is not about guaranteeing that nothing bad will happen because it will. Our software architecture is built on C and C++. Our hardware is overwhelmingly built without redundancies. Bad things always happen. I would argue that good engineering is being aware of the pitfalls, mitigating possible problems as much as possible and planning for failure.

Further reading: O-notation considered harmful, “In the long run…”

14 thoughts on “The big-O notation is a teaching tool”

  1. Well, you didn’t mention the right answer for such a question. clearly with big enough input, the algorithm with smaller big O will run faster and this seems to be the right answer wither we like it or not. still, it could be the case that we always or most of the time have input size much smaller than “big enough” in which it makes perfect sense to switch to an algorithm that run slower with big input and faster with small ones. thanks for yet another good entry.

  2. @Abdullah

    (…) with big enough input, the algorithm with smaller big O will run faster (…)

    It will run faster in some cases. And that is assuming that the computational model matches your architecture.

    Very often, the big-O notation says nothing about how fast the algorithm will be in practice.

    I am not saying you should use bubble sort on large arrays… I’m saying that there is a lot more to high performance computing than the big-O notation the same way there is a lot more to engine design than thermodynamics.

  3. I get it now and this part says it all:

    The problem with the big-O notation is that it is only meant to help you think about algorithmic problems. It is not meant to serve as the basis by which you select an algorithm!

    thanks for the clarification.

  4. The hyperbole surrounding worst case complexity analysis these days is really distasteful. Headlines like “The big-O notation is a teaching tool” or “O-notation considered harmful” seem to advocate that people be ignorant, even if the article text ends up amounting to: average-case is often more important.

    Any good developer should understand and consider both average and worst case performance. Both have real world implications, as to things like cache and memory structure.

    Big-oh notation is just an efficient way for people to express ideas like “doubling the size of the data multiplies the running time by four.” It has nothing to do with teaching nor “programmer experience.”

  5. @Alex

    My title is indeed hyperbolic. However, I am not advocating that people be ignorant. I believe that the big-O notation is one of the first things anyone serious about software should learn as it will save them much pain later on.

    I am, however, also saying that it is far from enough. If that’s all you know about how to assess the speed of an algorithm, you do not know enough.

  6. Hi Lemire. I follow your blog. Being on my way to get my PhD degree, I learn a lot from your writings. Especially the ones about academic writing.

    I have a comment about the following section.

    “When asked why the an algorithm with better computational complexity fails to be faster, people often give the wrong answers:”

    I think you mean an algorithm with a better running time. Computational complexity is defined for problems. Performance of algorithms are measured by running time. Complexity of a problem is determined by the best algorithm that solves the problem.

  7. @Aykut Bulut

    ” Computational complexity is defined for problems.”

    Complexity can also be defined for an algorithm or a program. More precisely information content of its data structures using algorithmic information theory i.e. Kolmogorov complexity.

    “Complexity of an problem is determined by the best algorithm that solves the problem.”

    Instead of the best, I think you mean the shortest algorithm. Measures of complexity varies. Seth Lloyd has a list of measures:

    http://web.mit.edu/esd.83/www/notebook/Complexity.PDF

  8. “..doubling the data size will multiply the running time by a factor of four, in the worst case. ”

    I think our ‘misconception’ come from that; Big-O notation measures asymptotic behaviour of the run time against the size of the input N i.e. very large N behaviour. So, the above statement may not be true for “small inputs”. For example if run-time complexity measured as 4*N^2 , we still call algorithms’ run time complexity as N^2. However for small N doubling the small data size MAY multiply the run time by a factor of 16!

  9. @Daniel Lemire

    It is an elementary concept. Sorry, 4*N^2 was not a good example and my numbers were wrong. Idea was illustrating asymptotic behaviour shown by big-O notation could be “misleading” for “smaller” input. Let’s take number of operation T(N)=4*N^2 – 2N. T'(4) must be 4 times of T'(2) if we consider as T'(N) = N^2. But if we use
    T(N) instead, T(4) would be 4.666667 time of T(2). Actually you have given this in your post, but I don’t know where 10TB example comes from, usually using more data full fills the scaling. Of course within reason, 10TB is too big for single machine (for now). In my experience with N-body force calculation O(N^2) algorithm performs better compare to O(NlogN) ones up to a break-even point even 50K particles. This is a good paper explaining this:

    Pringle, Gavin J. “Comparison of an O (N) and an O (N log N) N-body solver.” PPSC (1995): 337-342.

  10. Well, there are two problems with big-Os. One is that architecture does matter. A second one is that the constant under the big-O does matter as well (in certain cases, it is a bit more complicated than a constant, but a constant is a good approximation, anyway). In fact, these considerations are interconnected.

    An architecture, e.g., using a cache-friendly algorithm, or SIMD-instructions, does affect the constant. Yet, the effect is typically not so great. But it is of practical importance so that one should not rely on just asymptotic estimates.

    However, it really helps to think in terms Big-O plus a constant. This is not a premature optimization and often helps a lot. Regular expressions, as you may guess, is a bad example in my opinion. Because, it is possible to make most of them efficient. We could have even avoided this problem with developers who (1) respect the theory (2) understand not only Big-O, but also a hidden constant.

  11. I’ve encountered this effect firsthand. I was aware of the ideas here (at least to some degree); for instance I know that standard implementations of quicksort often switch to insertion sort on small arrays.

    I recently solved a problem in Matlab in three different ways. Two were Theta(N) solutions that involved making several passes and processing. The third involved sorting the array first, and then a comparatively simple follow up. To my surprise, the algorithm involving sorting was faster for all the array sizes that I tried (I think on the order of N = 1 million).

    Built in functions like sorting are likely to be so tweaked and optimized (especially in high level languages) compared to other functions you can write, that they may beat out Theta(N) algorithms.

Leave a Reply

Your email address will not be published. If you leave an email, you will be notified when there are replies. The comment form expects plain text. If you need to format your text, you can use HTML tags such <strong> and <em>. For formatting code as HTML automatically, I recommend tohtml.com.