Can your C compiler vectorize a scalar product?

If you have spent any time at all on college-level mathematics, you have probably heard of the scalar product:

float scalarproduct(float * array1, float * array2, size_t length) {
  float sum = 0.0f;
  for (size_t i = 0; i < length; ++i) {
    sum += array1[i] * array2[i];
  return sum;

Most people who need to do fancy things like dot products or matrix multiplications use hand-tuned libraries… but there are very reasonable reasons for the average programmer to come up with code that looks like a scalar product.

Vendors like Intel know that scalar products are important, so they have dedicated instructions to speed-up the computation of the scalar product. The approach taken by Intel involves SIMD instructions. SIMD instructions work over vectors of several words (such as 8 floats) at once. Using intrinsics, you can write your own hand-tuned scalar product…

float avx_scalarproduct(float * array1, float * array2, 
          size_t length) {
  __m256 vsum = _mm256_setzero_ps();
  size_t i = 0;
  for (; i + 7 < length; i += 8) { // could unroll further
    vsum = _mm256_fmadd_ps(_mm256_loadu_ps(array1 + i),
                       _mm256_loadu_ps(array2 + i),vsum);
  float buffer[8];
  float sum = buffer[0] + buffer[1] + 
                      buffer[2] + buffer[3] + 
                      buffer[4] + buffer[5] + 
                      buffer[6] + buffer[7];
  for (; i < length; ++i) {
      sum += array1[i] * array2[i];
  return sum;

Wow. Who wants to deal with code that looks so messy? With more effort, I can improve the performance further, but simply unrolling, but the code looks even more bizarre.

So is it worth the effort?

No. The GNU GCC compiler is able to compile my simple pure-C function to code that is as good as my hand-tuned function. (One caveat: you have to compile the code with the -ffast-math flag.) And the code produced by LLVM’s clang is more than twice as fast as my “hand-tuned” code (it corresponds to a fully unrolled version).

The source code of my full benchmark is available.

We can examine the assembly directly and verify that clang makes generous use of the vfmadd231ps instruction.

The fully unrolled scalar product code (see my GitHub repo) runs at a speed of about 0.2 cycles per pair of input floats. That’s what the clang compiler can produce and it is an order of magnitude faster than the regular code.

Why does any of this matters?

The larger question is: how do we scale up our software so that it can benefit from advanced parallelism? We know that multicore processing only helps so much: adding new cores to a processor adds a lot of complexity. It is really hard to test multicore software. Vectorization (through SIMD instructions) is conceptually much simpler. There are some minor downsides to wide vector instructions, such as the fact that we end up more register space and more expensive context switching, but it is simply not comparable to the difficulties inherent to multicore parallelism.

The fact that optimizing compilers are pretty good with SIMD instructions makes these instructions very compelling. We have ease of programming and performance in one nice bundle. In 2017, Intel should ship commodity processors with AVX-512 support, meaning that single instructions can process 512-bit registers. Soon enough, your compiler will take your good old code and multiply its speed, without any input on your part.

Note: This blog post was motivated by a Twitter exchange with Maxime Chevalier. My original claim was that the Swift compiler could probably automatically vectorize matrix multiplications, if I write it carefully. It is certainly true when I use a C compiler, but I don’t know how to get the same result in the Swift language given that I don’t know whether there is an equivalent to the -ffast-math flag in Swift. Thankfully, Swift code can call C without overhead, so you can probably simply rewrite your performance-sensitive floating-point code in C and all it from Swift. Or maybe the Swift engineers could give us -ffast-math in Swift?

Daniel Lemire, "Can your C compiler vectorize a scalar product?," in Daniel Lemire's blog, December 30, 2016.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

11 thoughts on “Can your C compiler vectorize a scalar product?”

  1. Ohhh wow. This is interesting from several points of view. We have been trying for a while to speed up scalar product with AVX. The gains were only marginal (say 10-20%). However, this version is, indeed, faster. So, this is the first piece of code useful to me, where AVX is much faster.

    Second, auto-vectorization is actually quite unreliable. For one thing, only very simple things can be vectorized. For another, what I noticed that it varies quite a bit accross compilers. Your example is great, because the new Clang does use a new command, the gcc (4.x and 5.x alike) does not! What is worse, older versions of Clang sometimes refused to vectorize even scalar products and similar simple things. So, basically, you shouldn’t rely on the compiler.

    Third, it is obvious that when things became so messy, the Intel architecture is up for a huge redesign. They can’t pile one set of crutches over another one. SIMD should be implemented in a more generic fashion.

    1. @Leonid:
      I’m not a compiler expert, but in general, a requirement for autovectorization is the absence of loop-carried dependencies across adjacent loops (i.e. one iteration of the loop should be independent of the other). In this case, the compiler will unroll the loop and group instructions together in a SIMD fashion. One has to do the same thing to get one’s code accelerated on a GPGPU (that, and more).

      Intel’s icc compiler generally does a far better job of autovectorization than clang or gcc. Perhaps it identifies vectorization opportunities better than clang/gcc. It does vary greatly between versions of compilers presumably because these capabilities are getting added over time.

      Finally, stating the obvious, but vectorization will only really help you if your code is bottlenecked by compute, and not waiting on memory.

      1. Intel’s icc compiler generally does a far better job of autovectorization than clang or gcc.

        Though I do not report my results with Intel’s icc, I tested it and it did no better than GCC. They are both bested by clang.

        1. Ah, yes, I should’ve compared the disassembly in the compiler explorer itself. Since sum is a loop-carried dependence, the common loop-unrolling trick doesn’t work. From glancing at the disassembly, clang’s output does a vectorized reduction until the iteration count is less than 32, and then switches to scalar for the final reduction.

          As a very basic experiment, if you replace “sum” in your code with another array that is indexed by variable “k” in an outer for loop, or pass sum as a pointer into the function, the vectorization goes away. It may be due to the conservatism around pointer aliasing, or that clang is looking for a very specific pattern that looks like a reduction. Even so, very impressive.

          In general, however, my experience has been that icc does (or at least did a year ago) vectorize much better than gcc and clang in the general case.

          1. PS: If you don’t pass the outer array (indexed on k, as noted above) as a parameter (malloc internally and return a pointer), it vectorizes fine. So, I think it’s the passing a pointer part that trips it up.

            Thanks, I learned something today 🙂

  2. Daniel: Here’s a link to ffast-math. You may get a different result from the unvectorized version of this code due to rounding issues in floating point (a*b+a*c != a*(b+c) in FP). Vectorization may result in arithmetic order being changed (mathematically correct if you had infinite precision FP). It also treats denormals (very small fractions) as zero. Most scientific code is OK with this, but it’s something to be aware of.

Leave a Reply

Your email address will not be published.

You may subscribe to this blog by email.