Most Computer Science textbooks assume that algorithms are written directly into machine language for an idealized machine under a Von Neumann architecture. Alas, at best, these idealized models provide “ballpark” guidance to writing high performance software. They get you to avoid quadratic-time algorithms when linear-time algorithms are available. Yet there can be orders of magnitude of difference between two linear-time implementations.

To make matters worse, increasingly, programmers work with high level languages like JavaScript, Python or Ruby. The programmers are further away from the idealized machine. So much so that the intuition one might build from traditional algorithmics can be detrimental.

Consider the following example. Sometimes you want to find the location of a maximum in an array. This operation is often called arg max. The Python language lacks a builtin arg max operation. In one of the embarrassing moments of this blog, I proposed a fancy way to compute arg max, using a single pass through the data and a constant amount of memory. It is what Computer Scientists would consider a good solution. The inelegant and inefficient alternative is to first compute the maximum, and then scan the array again to find a matching location. To my surprise, this simplistic solution was much more efficient than what I proposed. Today, I repeated the benchmark under Python 3.01:

argmax function time
array.index(max(array))
2.2 s
def maxarg(arr):
     counter = 0
     arg = 0
     m = arr[0]
     for x in arr:
          if x > m:
               m = x
               arg = counter
          counter += 1
    return arg
4.2 s
max(zip(array, range(len(array))))[1]
3.4 s
max([array[i],i] for i in range(len(array)))[1]
8.1 s
max((array[i],i) for i in range(len(array)))[1]
6.2 s
max(range(len(array)), key=array.__getitem__)
3.4 s

(Numbers and benchmark updated as of June 15, 2011.)

All of these solutions except for the first one go through the data only once. Yet the first solution is nearly twice as fast as any other alternative.

Work blending high performance computing and  high level languages is likely to become increasingly important. Last week, I met with a local start-up that does real-time Web ad auctions. Effectively, each time you visit a Web page, some of your data is sent to algorithms which must decide very quickly what an ad to you is worth. Most of their real-time architecture is written in JavaScript and runs  under the Google V8 engine. They assure me that JavaScript is not the bottleneck.

It used to be that to write high performance software, you needed to know a lot about how the hardware worked. This era is coming to an end. The language interpreters are the new machines.

Further reading: High Performance JavaScript by Zakas and Python for high performance computing by Cook.

Challenge: Can you beat the silly arg max implementation (array.index(max(array))) using pure Python?

Appendix: In C++, the one-pass arg max is about twice as fast.

Code: Source code posted on my blog is available from a github repository.

32 Comments

  1. I actually ran into this over 15 years ago with C++ compilers. Optimizer’s were relatively new and improving rapidly from version to version. My really cool, highly efficient roll you own algorithms often ‘broke’ the optimizer in that the optimizer could not figure out what I were trying to do and could not help me. Over one 2 year process I found my code going from very fast compared to ‘dumb’ implementations to slower than ‘dumb’ implementations.

    The key was that the optimizers were taking advantage of the full instruction set of what was then state of the art processors where as my algorithms assumed a much simpler architecture.

    Damn you Knuth!

    Comment by Jib — 14/6/2011 @ 15:15

  2. An aside not relevant to your main point: when using numpy, arrays have an argmax() method, which is much faster than any of these solutions. Actually I think numpy reinforces your point: I’m often doing multiple passes over arrays when not strictly necessary because it vectorizes better in numpy or Matlab.

    Comment by Iain — 14/6/2011 @ 15:46

  3. @Jib

    I suspect your experience is common. I also get into this sort of trouble in C++: the optimizer beats my hard work routinely.

    My point is that this is even more true in Python or JavaScript.

    Comment by Daniel Lemire — 14/6/2011 @ 15:48

  4. @Jib, if a compiler can use SSE operations, then it can beat you. However, if you use SSE operations explicitly, you are likely to beat the compiler.

    @Daniel, this is a rather common issue nowdays. Loops are often prohibitive in script languages, because a built-in function may be orders of magnitude faster.

    Comment by Itman — 14/6/2011 @ 16:07

  5. PS: However, this is not a new issue. Consider, e.g., a SQL with procedural extensions.

    Comment by Itman — 14/6/2011 @ 16:11

  6. Two reasons why this could be the case:

    If array.index() and max() are builtin functions, rather than interpreted, in a non-JIT language (like the Python 3 interpreter), then they will likely run much faster than the interpreted equivalent, even though the “algorithm” used for the others is more efficient. Presumably, this gap would close as the language interpreter gets faster or JITed.

    Second is that even in lower level languages like C, this “may” be the case. If the array.index() and max() functions run in a way that is very predictable for CPU branch prediction and makes efficient use of the processor cache (cache is properly prefetched) and the alternative has lots of sporadic branches and/or accesses data in unpredictable ways (so that there is little or no prefetching), as I suspect is the case with list comprehension and traversal (compared to simply scanning an array whose elements are beside each other; lists are probably linked lists – a notoriously cache-unfriendly data structure), then the code which technically does more work would still be faster.

    tl;dr: modern computers and languages are complex!

    Comment by Dan — 14/6/2011 @ 16:21

  7. @Dan I have updated my blog post with a link to a C++ test which shows that a single pass through the data is, as expected, about twice as fast.

    Comment by Daniel Lemire — 14/6/2011 @ 17:39

  8. You only ran each pass once!? or even if you did it multiple times you only posted averages!?
    In order for the data to be useful we also need you to post the individual samples so we can run it under statistical tests.

    Comment by mr foo it — 14/6/2011 @ 18:17

  9. func ArgMax(arr []int) (index int, max int) {
    for i,n := range arr {
    if i==0 || n > max {
    index = i
    max = n
    }
    }
    return
    }

    func main() {
    array := []int{1,2,3,4,5,6,7,6,5,4,11113,2,1}
    i,max := ArgMax(array)
    println(i)
    println(max)
    }

    Comment by Will Fitzgerald — 14/6/2011 @ 18:31

  10. Accumulating the maxarg as the array is populated and transformed gives you a O(1) query later. Of course it doesn’t work if you remove items, but if maxarg is your bottleneck, that might be a valuable constraint to impose.

    Alternately, if maxarg is truly your bottleneck and you do remove elements from the array, then consider spending more memory and maintaining a heap of indices to array elements, sorted by their array element value, for constant-time maxarg query at the cost of O(logn) add/remove.

    Holistic algorithm enhancement like that can often trump all other solutions- the ideal number of times to scan the array is 0, right? :)

    Put another way, if you’re optimizing the wrong aspect of the problem, who cares which approach is faster?

    Comment by Charles Nicholson — 14/6/2011 @ 19:09

  11. @mr foo it

    It is much better if you try to reproduce my results which you can do in seconds if you have a Mac or a Linux box. If not, you just need to install Python which takes minutes.

    Comment by Daniel Lemire — 14/6/2011 @ 19:13

  12. @Will I need to learn Go.

    Comment by Daniel Lemire — 14/6/2011 @ 19:14

  13. Python has some easy-to-use objects, but the overhead of dealing with them–dictionary accesses, function dispatch overhead, object creation–will often overwhelm any potential algorithmic advantages. To speed things up, usually you want to avoid these as much as possible. That’s why the array.index(max(array)) solution works so well. I don’t call my lists “array,” however because there is also an array type in Python (in the array module). If you make “array” an array.array instead of a list, this is even faster on my benchmark.

    In the real world, NumPy is often the best way to eliminate this problem. If you convert array to a NumPy array and use its argmax() function, it is quite fast indeed.

    Comment by Michael Hoffman — 14/6/2011 @ 19:38

  14. I think you left out the most obvious definition of it:

    def maxarg(arr):
    counter = 0
    arg = 0
    m = arr[0]
    for x in arr:
    if x > m:
    m = x
    arg = counter
    counter += 1
    return arg

    which run about as fast as array.index(max(array)) (on python 2.6, at least). Sometimes it’s a bit slower, sometimes a bit faster. Which is indeed surprising, although understandable. It’s the language overhead. Your solutions have a huge amount of overhead (some of they may even be running at O(2n)), that’s why they are that much slower.

    BTW, that code that runs on javascript, does it run in the client? If so, it’s pretty obvious why it isn’t the overhead. Otherwise, it seems to me like a silly choice and it really impress me if the javascript is not the bottleneck if they do anything other than read and write files (very low on processing) with it.

    Comment by Rafael — 14/6/2011 @ 20:25

  15. I don’t get that result at all, Rafael. On Python 2.6.5, using the script at https://gist.github.com/1026313:

    The index/max version takes 1.62 s, but the maxarg() version takes 5.09 s.

    Comment by Michael Hoffman — 14/6/2011 @ 20:33

  16. @Rafael

    Thanks but your function, according to my tests, is several times slower than array.index(max(array)). In fact, it is slower than any of the alternative, on my machine. Can you publish your numbers and your code?

    Comment by Daniel Lemire — 14/6/2011 @ 20:34

  17. @Hoffman

    I did not know about the array package: thanks! Unfortunately, I don’t get any speed gain by using a Python array instead of a list.

    Comment by Daniel Lemire — 14/6/2011 @ 20:39

  18. Daniel, here’s a better formatted sketch of ArgMax in Go — both a ‘generic’ version and one specific for ints, that also returns the max value

    https://gist.github.com/8856a9ce3b18c7550db7

    Comment by Will Fitzgerald — 15/6/2011 @ 1:08

  19. @Daniel: Of course that code will be much faster – its not doing the same thing as what I suspect the Python version (even if Python were generating equally optimal code) is doing.

    Your code is running once through an array (whose elements are beside each other), which can be easily prefetched. Your index(max()) version runs through the array twice, which means that the memory (since N elements won’t all fit into the cache at once) must be loaded at least a second time. Of course the single traversal will be about twice as fast.

    The Python version however is performing a list comprehension, which presumably (though I could be wrong) is doing more work than simply traversing the list (ie building a second list as it traverses the first) – much more work than your C++ implementation!

    The real reason theres such a speed difference in Python is builtins vs not builtins though.

    Comment by Dan — 15/6/2011 @ 2:11

  20. I think that the new, higher level languages still require good knowledge of how the machine works at lower level. The difference is that you should always know your language and runtime as well – something cheerfully omitted in CS courses and language-learning books.

    Because of that I’ve been suggesting to my professors to make assembly part of required courses. Not because we will use it to program, but because it opens a way to understand just what happens inside the runtime, instead of treating it like magic in hands of a new apprentice/novice.

    Comment by Paweł Lasek — 15/6/2011 @ 2:50

  21. @Daniel here at work my times were a little different than at home. However, the difference was never as big as yours. Perhaps doing a shuffle and creating a string like you did makes the difference, I don’t know. That wouldn’t make much sense.

    $ python –version
    Python 2.6.5
    $ python test.py
    1.22238516808
    3877433
    1.50368499756
    3877433
    $ python test.py
    1.1917090416
    3333202
    1.50043797493
    3333202
    $ python test.py
    1.34272694588
    5674904
    1.49921607971
    5674904
    $ python test.py
    1.08860707283
    1109969
    1.53280496597
    1109969

    Code is at: http://rafael.kontesti.me/test.py

    Comment by Rafael — 15/6/2011 @ 13:01

  22. Actually, I forgot I had set .py extension to be executed in that directory. Please refer to it at http://rafael.kontesti.me/test.txt

    Comment by Rafael — 15/6/2011 @ 13:04

  23. @All, who started posting the code and compare times,

    In my opinion, you miss the main point. In this or another case, your mileage may vary. However, a much simpler algorithm that seems to be inefficient may easily beat a more sophisticated once due to hardware and software specifics.

    Examples are numerous! To be more conclusive, IMHO, Daniel should have posted an example of a hash vs unfolded search loop. In C++!!! It is very instructive to see that loops, especially unfolded ones, do beat hashes on small data sets. Same thing for binary searching.

    High level script languages are often even worse with this respect. However, the general intuition is that built-in function often work faster than loops. Therefore, for mission-critical program pieces, one should measure performance to select the best variant.

    Comment by Itman — 15/6/2011 @ 13:17

  24. @Rafael

    Thanks for posting your code and timings. I now realize I was wrong to compare a random shuffle (C++) with actual random bits (Python). Therefore, I have updated my blog post with a random shuffle like yours.

    As you can see from my updated numbers, your function is still slower. However, if I revert back to Python 2.6, the gap is not so large. But it still faster (on average) to do array.index(max(array).

    Comment by Daniel Lemire — 15/6/2011 @ 13:48

  25. @Itman

    I will come back to this topic. Yes, I think that comparing hash tables with linear scan is interesting.

    Comment by Daniel Lemire — 15/6/2011 @ 13:51

  26. @Daniel,
    It is not just hash tables. A lot of fancy tree structures (R-trees, KD-trees, BTKs, even regular binary trees), perform best when elements close to leaves are stored in buckets and are sought sequentially.

    Comment by Itman — 15/6/2011 @ 13:55

  27. So, running Rafael’s benchmark, but setting the last value to the max:

    arr = range(10000000)
    random.shuffle(arr)
    arr[10000000-1]=10000000+1

    argmax consistently beats out index(max)

    I also see it consistently winning on his benchmark. BUT NOT IF THE ARRAY IS UNSHUFFLED.

    This implies it’s not (just) the builtins, but other, perhaps CPU related, things going on.

    One of the disadvantages here is *not* being able to peek at the assembly.

    Comment by Will Fitzgerald — 15/6/2011 @ 14:27

  28. @Dan: the cost of traversing an array depends on where it is in the memory hierarchy, which is not accounted for in typical algorithmic analysis.

    For example, you could tile the search for max and its index by iterating over a segment of the array that fits in cache, finding the max and its index of that segment, replacing the old values if necessary.

    This approach does go over the array twice, but only in cache. The array is gone over only once in main memory.

    It would be very interesting to figure out at which point in the problem size does this optimization ceases to outweigh the O() behaviour of a theoretically better algorithm.

    Comment by Eric LaForest — 15/6/2011 @ 16:50

  29. With pypy, which packs a JIT, your version is way faster than array.index(max(array)) (on my machine, by a factor 4).

    It’s quite well known by Python programmers that CPython has much faster “implicit” loops (e.g., max(l), l.index()) than “explicit” ones (for x in l), because implicit ones manage to avoid many of the interpreter overhead. All these things change, though, when you have a JIT.

    Comment by Matteo Dell'Amico — 22/6/2011 @ 12:03

  30. Nice point.
    It’d be interesting to compare what is faster — “native” C or C++ STL implementation like this:

    std::max_element(vec.begin(), vec.end()) – vec.begin();

    Comment by Roman Shapovalov — 8/7/2011 @ 2:28

  31. Also, how does it scale with changing array size?
    PS. Your CAPTCHA is really annoying. :)

    Comment by Roman Shapovalov — 8/7/2011 @ 2:30

  32. This APL programmer suggests that using a higher level language brings one _closer_ to the idealized machine.

    Comment by Randy A MacDonald — 8/1/2012 @ 22:31

Sorry, the comment form is closed at this time.

« Blog's main page

Powered by WordPress