By how much does AVX-512 slow down your CPU? A first experiment.

Intel is finally making available processors that support the fancy AVX-512 instruction sets and that can fit nicely in a common server rack. So I went to Dell and ordered a server with a Skylake-X microarchitecture: an Intel Xeon W-2104 CPU @ 3.20GHz.

This processor supports several interesting AVX-512 instruction sets. They are made of very powerful instructions that can manipulate 512-bit vectors.

On the Internet, the word out is that using AVX-512 in your application is going to slow down your whole server, so you should just give up and never use AVX-512 instructions.

Vlad Krasnov from Cloudfare wrote:

If you do not require AVX-512 for some specific high-performance tasks, I suggest you disable AVX-512 execution on your server or desktop, (…)

Table 15-16 in Intel’s optimization manual describes the impact of the various instructions you use on “Turbo Boost” (one of Intel’s frequency scaling technology). The type of instructions you use determines the “license” you are in. If you avoid AVX-512 and heavy AVX2 instructions (floating-point instructions and multiplications), you get the best boost. If you use light AVX-512 instructions or heavy AVX2 instructions, you get less of a boost… and you get the worst results with heavy AVX-512 instructions.

Intel sends us to a sheet of frequencies. Unfortunately, a quick look did not give me anything on my particular processor (Intel Xeon W-2104).

Intel is not being very clear:

Workloads that execute Intel AVX-512 instructions as a large proportion of their whole instruction count can gain performance compared to Intel AVX2 instructions, even though they may operate at a lower frequency. It is not always easy to predict whether a program’s performance will improve from building it to target Intel AVX-512 instructions.

What I am most interested in, is the theory that people seem to have that if you use AVX-512 sparingly, it is going to bring down the performance of your whole program. How could I check this theory?

I picked up a benchmark program that computes the Mandelbrot set. Then, using AVX-512 intrinsics, I added AVX-512 instructions to the program at select places. These instructions do nothing to contribute to the solution, but they cannot be trivially optimized away by the compiler. I used both light and heavy AVX-512 instructions. There are few enough of them so that the overhead is negligible… but if they slowed down the processor in a significant manner, we should be able to measure a difference.

The results?

moderunning time (average over 10)
no AVX-5121.48 s
light AVX-5121.48 s
heavy AVX-5121.48 s

Using spurious AVX-512 instructions made no difference to the running time in my tests. I don’t doubt that the frequency throttling is real, as it is described by Intel and widely reported, but I could not measure it.

This suggests that, maybe, it is less likely to be an issue than is often reported, at least on the type of processors I have. Or else I made a mistake in my tests.

In any case, we need reproducible simple tests. Do you have one?

My code and scripts are available.

Update: An anonymous source reports:

I believe that the reason you do not see any slowdown is that the Xeon W-2104 part does not enable the port 5 VPU. Furthermore, AVX-512 load-store instructions aren’t going to trigger the frequency drop. The worst instructions for that are float64 FMAs (or equivalent). I modified your code so that it runs on all the cores (via MPI) and does float64 FMAs or ADDs on the heavy and non-heavy AVX-512 paths, respectively. The performance effect I see looks like noise, but if it’s real, it’s less than 0.5% slowdown from ZMM usage.

15 thoughts on “By how much does AVX-512 slow down your CPU? A first experiment.”

        1. Disabling frequency scaling would likely hide the effect you’re trying to measure. The supposition I had from the Cloudflare blog was cores get frequency scaled down when AVX-512 was in use. You could re-run the experiment and use pcm-power: https://github.com/opcm/pcm

          This would tell you if there were any P-state transitions or thermal throttling events affecting runs. However, with only 1.48s of running time I wouldn’t expect any of those are firing.

  1. Does the code reach 100% cpu use? If it’s monothreaded probably it won’t be enough of a load to make a difference

    1. It is a CPU-bound test (all computations). I did not use a multithreaded version.

      Is your point that the slow down would only occur in heavily multithreaded code?

      That’s an interesting theory.

      1. I think the claim is that you can have scalar code executing happily on one core, until a single AVX-512 instruction is issued on another core. At that point the scalar code will slow down because the frequency will be reduced. I would be interested in seeing someone else demonstrating independently that this definitely happens, or banish this to the annals of benchmark mythology.

  2. Wow, this is kind of scandalous. A sketchy engineering decision that backstabs poor programmers trying to make sense of why their optimizations do not work as intended.

  3. Part of the contention from Cloudflare is that performance will be wildly variable depending on the family of chip you’ve got, silver vs gold vs platinum, as their throttling behaviour is different.

    Unfortunately Intel have successfully made an absolutely confusing mess of processor classifications and documentation.
    The documentation for the Xeon W processors indicates they’re based on the same chipsets as the Xeon Scalable family, but fail to provide sufficient information to be able to figure out how they perform when AVX is enabled.

    To get a realistic sense of things we’d need to be able to measure that frequency, I’d imagine.

    I can likely get access to a Xeon Platinum for a quick test, but the Platinum is least likely to experience the problems Cloudflare ran in to. The frequencies even when all cores are being used aren’t much of a drop from normal.

    1. Part of the contention from Cloudflare is that performance will be wildly variable depending on the family of chip you’ve got

      Ah.

      To get a realistic sense of things we’d need to be able to measure that frequency, I’d imagine.

      That’s an important diagnostic step, but it only makes sense once you can measure some slowdown. If your program runs at the same speed, then there is nothing to investigate. No story.

  4. Hello,
    I did a few vector AVX512 benchmarks. They are mostly arithmetic vector operations. I found that the peak of floating point multiplications is doubled from AVX2 (I configured bios for not throttling down). That’s the case when the vectors with samples are smaller than the cache pages, otherwise there is memory bottleneck. So, both AVX512 and AVX2 have same flops. I guess for intensive computations which require little memory and lots of operations AVX512 provides a better performance. Also, I’ve seen that different intel architectures have different performance in shuffling operations which are a really big bottleneck.

    In any case, benchmarking instructions sets is pretty complex and highly application dependant.

    1. That’s the case when the vectors with samples are smaller than the cache pages, otherwise there is memory bottleneck. So, both AVX512 and AVX2 have same flops.

      Can you publish a reproducible test case to illustrate your finding?

      1. I ran the test now for AVX512, AVX2 and SSE. It basically runs very basic vector operations. In the case of the first graph, it multiplies a vector with floats with a scalar float. The second is the result of multiplying two vectors. The measure is in “Megasamples” per second. This measure is computed by the array length divided by execution time.

        So, I would say that CPU frequency does not really make a difference when computing large amounts of data.

        You could find the test results for a i7 [email protected].
        https://pastebin.com/H7h7jhyr

        It would be great if you run your tests on a Xeon and compare the results.

        I used vector_test in srsLTE project. This test purpose is making sure that the SIMD abstraction is Ok for the target platform. I would say it’s quite a good test for benchmarking. https://github.com/srsLTE/srsLTE/blob/master/lib/src/phy/utils/test/vector_test.c
        You might find the compiling instructions in the project readme.

        1. You have too many numbers in this table for me to make sense of it.

          But I think what you are saying is that if the inputs are out-of-cache, then acceleration (SIMD-based) is useless. That is likely true for something as cheap as a dot product.

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax