Filling large arrays with zeroes quickly in C++

Travis Downs reports that some C++ compilers have trouble filling up arrays with values at high speed. Typically, to fill an array with some value, C++ programmers invoke the std::fill. We might assume that for a task as simple as filling an array with zeroes, the C++ standard library would provide the absolute best performance. However, with GNU GCC compilers, that is not the case.

The following line of C++ code gets compiled as a loop that fills each individual byte with zeroes when applying the conventional -O2 optimization flag.

std::fill(p, p + n, 0);

When the array is large, it can become inefficient compared to a highly efficient implementation like the C function like memset.

memset(p, 0, n);

Travis reports that the difference is a factor of thirty. He also explains how to fix the C++ so that it is as fast as the memset function.

What about very large arrays? What if you need to write zeroes all over a 2GB array?

memset 30 GB/s
std::fill 1.7 GB/s

Roughly speaking, the memset function is 15 times faster than std::fill in my test. This demonstrates that even if you are working with large volumes of data, you can be easily limited by your inefficient implementations. You are not automatically limited by your memory bandwidth. In fact, std::fill in this instance cannot even keep up with a good network adaptor or a fast disk. We routinely parse JSON data, and write out DOM trees, at speeds far higher than 1.7 GB/s.

Another lesson I would derive from this example is that writing your code in idiomatic C++ is not sufficient to guarantee good performance. There is a lot of engineering involved in making C++ fast, and though it often works out magically, it can fail too.

My source code is available.

Daniel Lemire, "Filling large arrays with zeroes quickly in C++," in Daniel Lemire's blog, January 20, 2020.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

25 thoughts on “Filling large arrays with zeroes quickly in C++”

        1. I think that the rationale is part of the GNU GCC documentation, it states that -O2 “performs nearly all supported optimizations that do not involve a space-speed tradeoff”. That is, the -O3 flag may grow the binary size to achieve better speed. And, of course, -O3 may fail to deliver better speed because of the trade-offs involved.

          I think that if -O3 delivered consistently better performance, it would be the new default at release time, but I don’t think it works this way.

  1. Disclaimer: this is on clang, so maybe it doesn’t matter.

    Anyway, looking at the code I noticed that the memory is set to 1 only once and then re-zeroed in all subsequent benchmarks… so perhaps only the first benchmark (std::fill) is doing something.
    So I swapped around the memset bench and the std::fill bench and this came out:

    page count: 32, volume:0.125 MB
    memset(p,0,n) : 9.2219 GB/s
    std::fill(p, p + n, '\0') : 39.1376 GB/s
    std::fill(p, p + n, 0) : 39.3015 GB/s

    memset(p,0,n) : 39.3268 GB/s
    std::fill(p, p + n, '\0') : 39.5562 GB/s
    std::fill(p, p + n, 0) : 39.6719 GB/s

    memset(p,0,n) : 39.4922 GB/s
    std::fill(p, p + n, '\0') : 39.8012 GB/s
    std::fill(p, p + n, 0) : 39.8272 GB/s

    and suddenly, memset is the bad one!
    In fact the first bench is the slowest almost every time on my machine.

    This is on MacOS 16 + clang… maybe this happens because once a page is zeroed, it’s remapped to the read-only zero-page and subsequent 0-writes don’t do anything?

    I changed the code to give each benchmark its own memory area, and that way I get about the same speed for all 3 methods, +/- some (pretty big) random fluctuations on all 3:

    char* src = new char[i*3];
    memset(src, 1, i * 3);

    for (int z = 0; z < 3; z++) {
    char* p = src;
    bench2(p, i);

    p += i;
    bench1(p, i);

    p += i;
    bench(p, i);

    std::cout << std::endl;

    delete[] src;

    … although the very first bench*() call (no matter which) is still at 9 gb/s. No idea why 🙂

      1. This is a bug in GNU C++ library std::fill/std::fill_n. Using the argument of the exact correct type for the fill value fixes the bug and makes it use memset.

        On one other hand you have memset, which you need to specify the correct size in bytes, despite it taking an int fill value (specifying wrong size for memset is a common bug in stackoverflow questions).

        1. Yes, you can get around the performance issue, as Travis explains. Whether it is a bug is debatable: it has been around forever and it is standard compliant.

          But the larger point is that nothing at all in the C++ standard guarantees that you will get the best possible performance with std::fill. This is just one example of many where using the idiomatic approach will fail you.

          Please check my previous posts where I show that just about the worse possible way to allocate and initialize memory is via “new()”, at least in GNU GCC.

          My argument is not that you should forgo the idiomatic way, or drop C++ in favor of C. My argument is laid out in my post. (Last paragraph.)

          1. Well, you are intentionally measuring performance of code hitting the library bug and yet you are not using -O3 optimization, despite targeting for top performance. It is hard to call such an approach scientifically sound, IMHO.

            1. The first link in the blog post is to Travis’ blog post, where he describes the issue in details. I have the following line “He also explains how to fix the C++ so that it is as fast as the memset function.”

              Do you disagree with the point I am making? The point is the following: “writing your code in idiomatic C++ is not sufficient to guarantee good performance”.

              If you disagree, I have many more arguments. In fact, I have literally hundreds of blog posts on this topic over 15 years.

              This happens all the time. It is actually kind of difficult to get bare metal performance out of C++ systematically. It is comparatively quite easy to write idiomatic C++.

  2. You are trying to write idiomatic code, but you made an intentional mistake in it by using a wrong type, which resulted in the sub-optimal code path. And to add insult to injury you pessimize it with -O2 instead of -O3.

    So, you have idiomatically looking code with a subtle bug in it and that’s what you benchmark.

    1. Please be clear, Maxim, about your disagreement. Here is my statement: “writing your code in idiomatic C++ is not sufficient to guarantee good performance”.

      Do you agree or do you disagree?

      1. “Writing your code in idiomatic C++ is not sufficient to guarantee good performance”. – I agree, just as well as I agree with “writing your code in non-idiomatic C++ is not sufficient to guarantee good performance either”.

        In both cases the code must be bug free, however subtle the bugs are.

        May be you are trying to say that passing literal 0 into memset doesn’t create a subtle performance bug, unlike with std::fill, and that can conditionally be true. People compiling with -O3 (like me) wouldn’t experience that bug either.

        I value information from most of your posts, nevertheless.

    1. Note compilers would become much better quickly if people take the
      time to report optimization issues rather than just complaining about
      them in various blogs…

      I am not exactly sure how it works regarding GNU GCC. For example, I know how to double the speed of the random shuffle and I sent a patch. I’m looking forward to see whether it makes it.

        1. Call it what you like! It is interesting to point out that various C++ constructs can have very different performance characteristics across compilers and even different versions of the same compiler.

          However these are examples of optimization inefficiencies which should not happen and are easily fixable in compilers – if only we hear about them!

          GCC has a bugzilla where you can report bugs or optimization issues. Contributing patches to GCC requires a copyright assignment, and to get your patches accepted you may need to post them repeatedly if people are busy and they don’t get any attention…

          1. My experience across a few compiler and similar projects is that most reported issues are either ignored, or initially discussed then subsequently ignored.

            Even compiler developers admit that the best way to get something fixed is to complain about it in a visible way (not saying Daniel is doing this), at which point it gets fixed.

            1. Well the best way to get things done in an open source project is to be extremely proactive, so ask questions, file bugs, post patches etc. There are always reports about trivial things which don’t matter much for performance.

              A common example is emitting a redundant move in a small function (usually due to argument passing, register allocation/scheduling interactions etc). 100% optimal code is as impossible as perfect register allocation or scheduling.

              Compiler experts are busy, and prefer to spend their time on things that are achievable and make a measurable improvement!

  3. I think it is at least arguable that using

    std::fill(p, p + n, 0);

    where p is a char pointer is not really idiomatic C++, because std::fill is defined in terms of assignments and thus is pretty clear that (formally) a type conversion is going to take place. Still, it’s clearly desirable for an implementation to detect and optimise this usage as well.

    (This example reminds me a bit of how people regularly misuse std::iota, std::accumulate, etc. by using the wrong type for the initializer)

Leave a Reply

Your email address will not be published.

You may subscribe to this blog by email.