By how much does AVX-512 slow down your CPU? A first experiment.

Intel is finally making available processors that support the fancy AVX-512 instruction sets and that can fit nicely in a common server rack. So I went to Dell and ordered a server with a Skylake-X microarchitecture: an Intel Xeon W-2104 CPU @ 3.20GHz.

This processor supports several interesting AVX-512 instruction sets. They are made of very powerful instructions that can manipulate 512-bit vectors.

On the Internet, the word out is that using AVX-512 in your application is going to slow down your whole server, so you should just give up and never use AVX-512 instructions.

Vlad Krasnov from Cloudfare wrote:

If you do not require AVX-512 for some specific high-performance tasks, I suggest you disable AVX-512 execution on your server or desktop, (…)

In Vlad’s case, they found that disabling AVX-512 on a particular multiplication-heavy routine was beneficial. That particular routine runs 30% faster on a per-cycle basis when using AVX-512. It uses expensive multiplication instructions.

Table 15-16 in Intel’s optimization manual describes the impact of the various instructions you use on “Turbo Boost” (one of Intel’s frequency scaling technology). The type of instructions you use determines the “license” you are in. If you avoid AVX-512 and heavy AVX2 instructions (floating-point instructions and multiplications), you get the best boost. If you use light AVX-512 instructions or heavy AVX2 instructions, you get less of a boost… and you get the worst results with heavy AVX-512 instructions.

Intel sends us to a sheet of frequencies. Unfortunately, a quick look did not give me anything on my particular processor (Intel Xeon W-2104).

Intel is not being very clear:

Workloads that execute Intel AVX-512 instructions as a large proportion of their whole instruction count can gain performance compared to Intel AVX2 instructions, even though they may operate at a lower frequency. It is not always easy to predict whether a program’s performance will improve from building it to target Intel AVX-512 instructions.

What I am most interested in, is the theory that people seem to have that if you use AVX-512 sparingly, it is going to bring down the performance of your whole program. How could I check this theory?

I picked up a benchmark program that computes the Mandelbrot set. Then, using AVX-512 intrinsics, I added AVX-512 instructions to the program at select places. These instructions do nothing to contribute to the solution, but they cannot be trivially optimized away by the compiler. I used both light and heavy AVX-512 instructions. There are few enough of them so that the overhead is negligible… but if they slowed down the processor in a significant manner, we should be able to measure a difference.

The results?

moderunning time (average over 10)
no AVX-5121.048 s
light AVX-5121.048 s
heavy AVX-5121.048 s

Using spurious AVX-512 instructions made no difference to the running time in my tests. I don’t doubt that the frequency throttling is real, as it is described by Intel and widely reported, but I could not measure it.

This suggests that, maybe, it is less likely to be an issue than is often reported, at least on the type of processors I have. Or else I made a mistake in my tests.

In any case, we need reproducible simple tests. Do you have one?

My code and scripts are available.

Update: An anonymous source reports:

I believe that the reason you do not see any slowdown is that the Xeon W-2104 part does not enable the port 5 VPU. Furthermore, AVX-512 load-store instructions aren’t going to trigger the frequency drop. The worst instructions for that are float64 FMAs (or equivalent). I modified your code so that it runs on all the cores (via MPI) and does float64 FMAs or ADDs on the heavy and non-heavy AVX-512 paths, respectively. The performance effect I see looks like noise, but if it’s real, it’s less than 0.5% slowdown from ZMM usage.

Major Update:

Intel’s latest processors have advanced instructions (AVX-512) that may cause the core, or maybe the rest of the CPU to run slower because of how much power they use. I have been struggling to measure this effect, and it might be because it is far more complicated than some imagined. We know that Intel’s own compiler is shy about using AVX-512 instructions and it seems that Java might be disabling AVX-512 instructions by default.

In another post, I reported a finding by Travis Downs to the effect that it is hard to get a full throttle even if you use very expensive AVX-512 instructions.

Now I want to report another of Travis’s findings: it might be ridiculously easy to incur a frequency penalty even if you do not make use of AVX-512 instructions.

So let us consider this C program:

#include <x86intrin.h>
#include <stdlib.h>
int main(int argc, char **argv) {
  if(argc>1) _mm256_zeroupper();
  float a = 3;
  float b = rand();
  if(argc>2) _mm256_zeroupper();
  for(int k = 0; k < 5000000; k++) {
    b = a + b * b;
    a = b - a * a;
  }
  return (b == 0)? 1 : 0;
}

Except for the call to the _mm256_zeroupper function, this program should be straight-forward to C literate programmers. The program starts and it does some crazy floating-point number computation after initializing one of the variable with a call to the rand function. The _mm256_zeroupper function simply zero the most significant bits of the first sixteen vector registers.

My scenario is not sensitive to how fanciful the computation is, but it matters than I am using floating-point numbers.

I compile this function while disabling 512-bit registers:

gcc -O2 -o fun fun.c -march=native -mno-avx512f

I can verify that the resulting binary does not rely explicitly on any 256-bit or 512-bit register (I am under Linux):

objdump -d ./fun |egrep "(ymm|zmm)"

My program calls _mm256_zeroupper either never, or just at the start of the program, or both at the start and after the call to the rand function. I run it on an AVX-512 capable Intel Xeon W-2104 CPU @ 3.20GHz. Here are my results:

total cyclesrunning time
no _mm256_zeroupper40M14.5 ms
_mm256_zeroupper at the start40M14.5 ms
_mm256_zeroupper after rand40M12.9 ms

Thus, to get the best performance, I need to call _mm256_zeroupper after the rand function. Is something evil going on with the rand function? No. I can reproduce the same problem with other standard library function calls like printf, atoi

Why is _mm256_zeroupper necessary?

Many operations on modern Intel hardware runs in vector registers even when they are not “vectorized”. Many floating-point operations, even if they just even 32 bits, are actually executed on 128-bit registers.

The latest Intel processors have 128-bit, 256-bit and 512-bit vector registers, but the first sixteen vector registers are actually the same, meaning that the same vector register can be viewed as either have 128 bits, 256 bits or 512 bits.

Intel processors can execute operation on 128-bit registers always at full speed because they are not expensive operations.

My code above should be cheap because it only “uses” 128-bit registers. But does it?

Travis explained to me that the processor is keeping track of whether the most significant bits of the vector registers are “clean” (initialized to zero) or dirty (potentially containing data). When the registers are “clean”, then the processor can just treat the 128-bit registers as genuine 128-bit registers. However, if the most significant bits potentially contain data, then the processor actually has to treat them as 512-bit registers. And thus we get a significant (10%) slowdown of the program even if we are not, ourselves, using AVX-512 instructions or registers.

It turns out that the GNU standard library (prior to version 2.25) had a bug in its __dl_runtime_resolve function which gets called every time a new function is called dynamically. This function would attempt to restore the the content of the 512-bit registers needlessly, thus making them dirty.

A simple fix is to patch your code with calls to _mm256_zeroupper. It is reasonably cheap (about 4 cycles). A better fix is to update your standard library or choose a recent Linux distribution.

Further reading: AVX-512: when and how to use these new instructions

Published by

Daniel Lemire

A computer science professor at the Université du Québec (TELUQ).

15 thoughts on “By how much does AVX-512 slow down your CPU? A first experiment.”

        1. Disabling frequency scaling would likely hide the effect you’re trying to measure. The supposition I had from the Cloudflare blog was cores get frequency scaled down when AVX-512 was in use. You could re-run the experiment and use pcm-power: https://github.com/opcm/pcm

          This would tell you if there were any P-state transitions or thermal throttling events affecting runs. However, with only 1.48s of running time I wouldn’t expect any of those are firing.

  1. Does the code reach 100% cpu use? If it’s monothreaded probably it won’t be enough of a load to make a difference

    1. It is a CPU-bound test (all computations). I did not use a multithreaded version.

      Is your point that the slow down would only occur in heavily multithreaded code?

      That’s an interesting theory.

      1. I think the claim is that you can have scalar code executing happily on one core, until a single AVX-512 instruction is issued on another core. At that point the scalar code will slow down because the frequency will be reduced. I would be interested in seeing someone else demonstrating independently that this definitely happens, or banish this to the annals of benchmark mythology.

  2. Wow, this is kind of scandalous. A sketchy engineering decision that backstabs poor programmers trying to make sense of why their optimizations do not work as intended.

  3. Part of the contention from Cloudflare is that performance will be wildly variable depending on the family of chip you’ve got, silver vs gold vs platinum, as their throttling behaviour is different.

    Unfortunately Intel have successfully made an absolutely confusing mess of processor classifications and documentation.
    The documentation for the Xeon W processors indicates they’re based on the same chipsets as the Xeon Scalable family, but fail to provide sufficient information to be able to figure out how they perform when AVX is enabled.

    To get a realistic sense of things we’d need to be able to measure that frequency, I’d imagine.

    I can likely get access to a Xeon Platinum for a quick test, but the Platinum is least likely to experience the problems Cloudflare ran in to. The frequencies even when all cores are being used aren’t much of a drop from normal.

    1. Part of the contention from Cloudflare is that performance will be wildly variable depending on the family of chip you’ve got

      Ah.

      To get a realistic sense of things we’d need to be able to measure that frequency, I’d imagine.

      That’s an important diagnostic step, but it only makes sense once you can measure some slowdown. If your program runs at the same speed, then there is nothing to investigate. No story.

  4. Hello,
    I did a few vector AVX512 benchmarks. They are mostly arithmetic vector operations. I found that the peak of floating point multiplications is doubled from AVX2 (I configured bios for not throttling down). That’s the case when the vectors with samples are smaller than the cache pages, otherwise there is memory bottleneck. So, both AVX512 and AVX2 have same flops. I guess for intensive computations which require little memory and lots of operations AVX512 provides a better performance. Also, I’ve seen that different intel architectures have different performance in shuffling operations which are a really big bottleneck.

    In any case, benchmarking instructions sets is pretty complex and highly application dependant.

    1. That’s the case when the vectors with samples are smaller than the cache pages, otherwise there is memory bottleneck. So, both AVX512 and AVX2 have same flops.

      Can you publish a reproducible test case to illustrate your finding?

      1. I ran the test now for AVX512, AVX2 and SSE. It basically runs very basic vector operations. In the case of the first graph, it multiplies a vector with floats with a scalar float. The second is the result of multiplying two vectors. The measure is in “Megasamples” per second. This measure is computed by the array length divided by execution time.

        So, I would say that CPU frequency does not really make a difference when computing large amounts of data.

        You could find the test results for a i7 [email protected].
        https://pastebin.com/H7h7jhyr

        It would be great if you run your tests on a Xeon and compare the results.

        I used vector_test in srsLTE project. This test purpose is making sure that the SIMD abstraction is Ok for the target platform. I would say it’s quite a good test for benchmarking. https://github.com/srsLTE/srsLTE/blob/master/lib/src/phy/utils/test/vector_test.c
        You might find the compiling instructions in the project readme.

        1. You have too many numbers in this table for me to make sense of it.

          But I think what you are saying is that if the inputs are out-of-cache, then acceleration (SIMD-based) is useless. That is likely true for something as cheap as a dot product.

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax