If you are reading a random textbook on computer science, it is probably going to tell you all about how good sorting algorithms take linearithmic time. To arrive at this result, they count the number of operations. That’s a good model to teach computer science, but working programmers need more sophisticated models of software performance.
On modern superscalar processors, we expect in-memory sorting to limited by how far ahead the processor can predict where the data will go. Though moving the data in memory is not free, it is a small cost if it can be done predictably.
We know that sorting “already sorted data” can be done in an easy-to-predict manner (just do nothing). So it should be fast. But how much faster is it that sorting randomly shuffled data?
I decided to run an experiment.
I use arrays containing one million distinct 32-bit integers, and I report the time in CPU cycles per value on a Haswell processor. I wrote my code in C++.
function | sorted data | shuffled data | sorted in reverse |
---|---|---|---|
std::sort | 38 | 200 | 30 |
For comparison, it takes roughly n log(n) comparisons to sort an array of size n in the worst case with a good algorithm. In my experiment, log(n) is about 20.
The numbers bear out our analysis. Sorting an already-sorted array takes a fraction of the time needed to sort a shuffled array. One could object that the reason sorting already-sorted arrays is fast is because we do not have to move the data so much. So I also included initial arrays that were sorted in reverse. Interestingly, std::sort is even faster with reversed arrays! This is clear evidence for our thesis.
(The C++ source code is available. My software includes timsort results if you are interested.)
I’ve tried to find a difference between Dutch National Flag algorithms based on the number of swaps they do, but nothing seemed to show up in the timings.
String sorting is also particularly interesting on current architectures, for both caching and branch prediction reasons.
One typo to note: you put an n in front of “log n is 20”.
Variable-length or very long strings can make it hard for the processor to look ahead. That’s why it is a good idea to get rid of strings when you can.
Well, I could go on about that, but for long strings (total distinguishing prefixes D >> n log n) the most common character comparison is equality, so branch prediction is pretty easy. It’s the cache complexity that sucks — many classic algorithms are O(D) in cache misses.
so branch prediction is pretty easy
I did not use the expression “branch prediction” anywhere because I think that counting branch mispredictions is too simplistic.
Variable-length strings can seriously hurt superscalar execution even if there are few branch mispredictions… because the processor can’t tell when the string ends… and so it does not start working on the next string.
You are right that memory accesses are going to be expensive, but if they can be predicted ahead of time, they can be free… because the data has been prefetched.
The problem with variable-length strings is that they can blind the processors to what is coming next. It just sees the string and does not look at the work that needs to be done after looping over the string.
Of course, if the strings do not fit in CPU cache, then you are going to start having performance trouble… at some point, you cannot sort faster than you can read and write to RAM… but there are cache-friendly algorithms that can help…
I see, yes I did mistake branch prediction for cache prediction. The latter indeed suffers with anything using indirect references, which the memory predictor can’t predict, and the pointers are by definition permuted vs. the original data (or at least they become that way after a few rounds). That’s what makes it so exciting.
The good news is that it’s possible to sort with only O(n log n) real cache misses, with the other O(D) character accesses being contiguous and prefetchable.
Hi Daniel, so what is the result of that timsort on your machine?
@Michel
Timsort is a pretty good idea. It is 50% slower on shuffled arrays, but drastically (10x) faster on sorted ones.
You should compare different implementations of the standard library. If I recall correctly, clang’s `libc++` is optimized for the common cases, whereas gcc’s `libstdc++` is not.
Do you expect that my analysis will depend on the standard library I use?
I sure hope so, otherwise their optimizations would be pointless 🙂