A case study in the performance cost of abstraction (C++’s std::shuffle)

Statisticians and machine-learning experts sometimes need to shuffle data quickly and repeatedly. There is one standard and simple algorithm to shuffle an array, the so-called Fisher-Yates shuffle:

for (i=size; i>1; i--) {
  nextpos = random_numbers_in_range(0,i);
  swap(storage[i-1], storage[nextpos]);
}

Not very difficult, is it? The C++ programming language, like many others, have a standard function to solve this problem (std::shuffle).

How does it fare against a very basic Fisher-Yates shuffle without any optimization whatsoever? To make sure that the comparison is fair, let us work with the same data (an array of strings) and use the same random number generator (I chose PCG). To avoid caching issues, let us use a small array that fits in cache.

Here is my unoptimized C++ code:

template <class T>
void  shuffle(T *storage, uint32_t size) {
    for (uint32_t i=size; i>1; i--) {
        uint32_t nextpos = pcg32_random_bounded(i);
        std::swap(storage[i-1],storage[nextpos]);
    }
}

I claim that this is the “textbook” implementation… meaning that it is the closest thing you can get to taking a textbook and coding up the pseudo-code in C++. (Yes I know that people copy code from Wikipedia or StackOverflow, not textbooks, but humor me.)

The pcg32_random_bounded function I call is implemented in a standard (but suboptimal way) to get a random number in a range with two divisions. You can do it with a single multiplication instead but let us ignore optimizations.

Here are my results, expressed in CPU clock cycles per value… (Skylake processor, clang++ 4.0)

techniqueclock cycles per value
std::shuffle73
textbook code29

So the textbook code is twice as fast as the standard C++ function.

Why is that?

It is often the case that default random number generators are slow due to concurrency issues. But we provide our own random number generators, so it should not be an issue.

A cursory analysis reveals that the most likely reason for the slowdown is that the standard C++ library tends to use 64-bit arithmetic throughout (on 64-bit systems). I implicitly assume, in my textbook implementation, that you are not going to randomly shuffle arrays containing more than 4 billion elements. I don’t think I am being unreasonable: an array of 4 billion std::string values would use at least 128 GB of RAM. If you need to shuffle that much data, you probably want to parallelize the problem. But, from the point of view of the engineers working on the standard library, they have to work with the requirements set forth by the specification. So 64-bit arithmetic it is! And that’s how I can beat them without any effort.

The absolute numbers might also surprise you. The Fisher-Yates shuffle is very efficient. We do a single pass over the data. We do not really need to look at the data, just move it. We use a fast random number generator (PCG). How can we end up with 30 or 70 cycles per array element?

Part of the problem is our use of std::string. On my system, a single (empty) std::string uses 32 bytes whereas a pointer (char *) uses only 8 bytes. If we fall back on C strings (char *), we can accelerate the processing simply because there is less data to move. Without going overboard with optimizations, I can bring the computational cost to about 7 cycles per element by avoiding divisions and using C strings instead of std::string objects. That’s an order of magnitude faster than the standard shuffle function.

So while std::shuffle is very general, being able to sort just about any array using just about any random-number generator… this generality has a cost in terms of performance.

My code is available.

15 thoughts on “A case study in the performance cost of abstraction (C++’s std::shuffle)”

  1. Oddly, the difference persists when using 64 integer types throughout, and when using a different (64 bit) random number generator. But only on clang/libc++. Using g++ 6.0 and libstdc++, the difference vanishes completely.

    This suggests a rather serious bug in the libc++ implementation of std::shuffle: std::shuffle was specifically specified to allow implementations that do not cause runtime penalties due to abstraction, and in this function it replaces the now-deprecated std::random_shuffle, the interface of which required inefficient implementations.

    For what’s it worth, I don’t think an implementation can actually do better than your “naïve” textbook implementation, and implementors should use it (libstdc++ does, libc++ doesn’t).

  2. And there’s a reason he doesn’t actually show input and produce a total running time output and instead cheats and just shows “clock cycles per value”. Because for any small shuffle, it makes no difference, and for any very large value, you need to consider whether you’re shuffling billions of values.
    Hence, std::shuffle.
    Unless you’re shuffling many collections constantly, this optimization isn’t worthy of any note. BTW, just for edification, I wrote a shuffler that runs as a micro-service on the network. All it does is return a shuffled deck of cards for the next hand of poker for thousands and thousands of tables of poker running concurrently (for a real live poker app. that is in production). I would never have bothered with this “optimization”.
    EDIT: If I’m doing the math right (and I often don’t), his example code runs in about 25ns using std::shuffle. And roughly half that using his “textbook” version. So, how many items do we need to be shuffling for any of this to make any real difference? Roughly billions (see above). Talk about useless/premature optimization.

    1. “Talk about useless/premature optimization.”

      This is neither useless nor premature. I surmise you are from the school that misinterprets Hoare and Knuth, and uses it as an excuse to not bother thinking about what you’re coding.

  3. I don’t think I can get on board with the conclusion/title.

    First off, std::string vs const char * seems to me to be entirely orthogonal to the implementation of std::shuffle. shuffle will shuffle any iterable container, so you can have an array of string or an array of const char *. Maybe you are adding string vs const char * to the “cost of abstraction” argument (title makes it seems like it’s only shuffe)? But in any case, string is not slower than const char * in this particular case because of abstraction. It’s slower here because it has different performance characteristics (as part of a deliberate choice); knowing its own size, having extra space, and avoiding heap allocations in many cases, all allow it to be faster than const char * in many other cases, just not in this one.

    Second, the thing that actually makes shuffle slow is not abstraction, it’s an implementation detail, which is motivated by a need for correctness (arrays with at least 4 billion elements, which with const char * would only require 32 gigs of RAM, btw). The only abstraction in shuffle’s interface is the use of iterators instead of raw pointers, but this hasn’t been shown to have any cost.

    Ironically, shuffle’s implementation could be trivially altered to have optimal performance/correctness characteristics by adding more abstraction. Namely, you could add another template parameter controlling the integer type used. It would default to size_t for correctness, but users could switch it to whatever was fastest on their platform if they had foreknowledge of their array size.

    Of course, this comes at a cost in interface complexity. Anyhow, interesting information, but I don’t agree with the conclusions.

    1. Anyhow, interesting information, but I don’t agree with the conclusions

      That’s fair. Note however that I always post my code and that my experimental results should be easily reproducible. So you can disagree with my interpretation, but, hopefully, you should not disagree with the experimental observations.

      Maybe you are adding string vs const char * to the “cost of abstraction” argument (title makes it seems like it’s only shuffe)? But in any case, string is not slower than const char * in this particular case because of abstraction. It’s slower here because it has different performance characteristics (as part of a deliberate choice); knowing its own size, having extra space, and avoiding heap allocations in many cases, all allow it to be faster than const char * in many other cases, just not in this one.

      std::string has a higher abstraction level than char *. For performance-sensitive code, I avoid std::string. For the type of work I do, it has historically always been slower than a C string. I hear that in C++11, some of the performance issues with std::string have been addressed, but prior to this, there is ample documentation on the problems caused by std::string. I should add that I am not a fan of the char * paradigm on the principle that it encourages branching… but experimental evidence shows that it is very good for performance in many applications.

      Second, the thing that actually makes shuffle slow is not abstraction, it’s an implementation detail, (…)

      We are comparing calling a library that handles all the implementation details for us, with coding our own so that the implementation is transparent. The library call has a higher level of abstraction. A higher level of abstraction can be beneficial, and it can have a cost… usually, you have a bit of both… benefits and costs…

      The only abstraction in shuffle’s interface is the use of iterators instead of raw pointers, but this hasn’t been shown to have any cost.

      I haven’t checked but I do not expect the iterators to be an issue as far as performance goes (based on past experience).

      Iterators in C++ are great performance-wise. (In Java, they can be a problem.)

      Ironically, shuffle’s implementation could be trivially altered to have optimal performance/correctness characteristics by adding more abstraction. Namely, you could add another template parameter controlling the integer type used. It would default to size_t for correctness, but users could switch it to whatever was fastest on their platform if they had foreknowledge of their array size.

      Or you could branch on the number of elements that there are in the container. I think that the library could be engineered so that the difference I indicate goes away.

  4. Unfortunately you do not conclusively answer the question ‘why?’ Please investigate further what the reason is, for example by analyzing the generated code.

        1. I did work it out and it is explained in my blog post. Divisions are the bottleneck (I explain that by removing them I can multiply the speed by 10x). The standard library uses 64-bit arithmetic. (Hint: a 64-bit division is much slower than a 32-bit division…) So that is all in the blog post.

          It is entirely reasonable to want a more thorough analysis, but I keep my blog posts short by design, preferring instead to refer the readers to the code and encouraging experiments.

          In any case, whatever deep analysis I come up with can be reasonably questioned. Why did you use this compiler and not this other compiler? Why did you use this standard library? What about the processor, what happens if you try another… it is a rabbit hole… one has to define the scope somehow.

  5. Here’s one more argument supporting custom (as opposed to std::) shuffling algorithm:

    How often is it that you need a whole array shuffled anyway? Most use cases for shuffling only require a handful of items (say, two hands of five cards from a deck of 52+2) where items are only required to be distinct from each other, and from other hands. Hence a more efficient algorithm would amortize the shuffling cost over each draw of a next item, in order to never need to shuffle the entire array.

    1. There is some surprising unexpectedness in the behavior of many psuedo-random number generation situations, where random-pick does not contain enough variation to approximate real behavior of cards as compared to an actual shuffle.

      There may scenarios where this possible weakness does not matter, but when modelling cards, an actual shuffle is typically a much closer modelling than a random-pick.

Leave a Reply

Your email address will not be published. Required fields are marked *