By how much does AVX-512 slow down your CPU? A first experiment.

Intel is finally making available processors that support the fancy AVX-512 instruction sets and that can fit nicely in a common server rack. So I went to Dell and ordered a server with a Skylake-X microarchitecture: an Intel Xeon W-2104 CPU @ 3.20GHz.

This processor supports several interesting AVX-512 instruction sets. They are made of very powerful instructions that can manipulate 512-bit vectors.

On the Internet, the word out is that using AVX-512 in your application is going to slow down your whole server, so you should just give up and never use AVX-512 instructions.

Vlad Krasnov from Cloudfare wrote:

If you do not require AVX-512 for some specific high-performance tasks, I suggest you disable AVX-512 execution on your server or desktop, (…)

Table 15-16 in Intel’s optimization manual describes the impact of the various instructions you use on “Turbo Boost” (one of Intel’s frequency scaling technology). The type of instructions you use determines the “license” you are in. If you avoid AVX-512 and heavy AVX2 instructions (floating-point instructions and multiplications), you get the best boost. If you use light AVX-512 instructions or heavy AVX2 instructions, you get less of a boost… and you get the worst results with heavy AVX-512 instructions.

Intel sends us to a sheet of frequencies. Unfortunately, a quick look did not give me anything on my particular processor (Intel Xeon W-2104).

Intel is not being very clear:

Workloads that execute Intel AVX-512 instructions as a large proportion of their whole instruction count can gain performance compared to Intel AVX2 instructions, even though they may operate at a lower frequency. It is not always easy to predict whether a program’s performance will improve from building it to target Intel AVX-512 instructions.

What I am most interested in, is the theory that people seem to have that if you use AVX-512 sparingly, it is going to bring down the performance of your whole program. How could I check this theory?

I picked up a benchmark program that computes the Mandelbrot set. Then, using AVX-512 intrinsics, I added AVX-512 instructions to the program at select places. These instructions do nothing to contribute to the solution, but they cannot be trivially optimized away by the compiler. I used both light and heavy AVX-512 instructions. There are few enough of them so that the overhead is negligible… but if they slowed down the processor in a significant manner, we should be able to measure a difference.

The results?

moderunning time (average over 10)
no AVX-5121.48 s
light AVX-5121.48 s
heavy AVX-5121.48 s

Using spurious AVX-512 instructions made no difference to the running time in my tests. I don’t doubt that the frequency throttling is real, as it is described by Intel and widely reported, but I could not measure it.

This suggests that, maybe, it is less likely to be an issue than is often reported, at least on the type of processors I have. Or else I made a mistake in my tests.

In any case, we need reproducible simple tests. Do you have one?

My code and scripts are available.

11 thoughts on “By how much does AVX-512 slow down your CPU? A first experiment.”

        1. Disabling frequency scaling would likely hide the effect you’re trying to measure. The supposition I had from the Cloudflare blog was cores get frequency scaled down when AVX-512 was in use. You could re-run the experiment and use pcm-power: https://github.com/opcm/pcm

          This would tell you if there were any P-state transitions or thermal throttling events affecting runs. However, with only 1.48s of running time I wouldn’t expect any of those are firing.

  1. Does the code reach 100% cpu use? If it’s monothreaded probably it won’t be enough of a load to make a difference

    1. It is a CPU-bound test (all computations). I did not use a multithreaded version.

      Is your point that the slow down would only occur in heavily multithreaded code?

      That’s an interesting theory.

      1. I think the claim is that you can have scalar code executing happily on one core, until a single AVX-512 instruction is issued on another core. At that point the scalar code will slow down because the frequency will be reduced. I would be interested in seeing someone else demonstrating independently that this definitely happens, or banish this to the annals of benchmark mythology.

  2. Wow, this is kind of scandalous. A sketchy engineering decision that backstabs poor programmers trying to make sense of why their optimizations do not work as intended.

  3. Part of the contention from Cloudflare is that performance will be wildly variable depending on the family of chip you’ve got, silver vs gold vs platinum, as their throttling behaviour is different.

    Unfortunately Intel have successfully made an absolutely confusing mess of processor classifications and documentation.
    The documentation for the Xeon W processors indicates they’re based on the same chipsets as the Xeon Scalable family, but fail to provide sufficient information to be able to figure out how they perform when AVX is enabled.

    To get a realistic sense of things we’d need to be able to measure that frequency, I’d imagine.

    I can likely get access to a Xeon Platinum for a quick test, but the Platinum is least likely to experience the problems Cloudflare ran in to. The frequencies even when all cores are being used aren’t much of a drop from normal.

    1. Part of the contention from Cloudflare is that performance will be wildly variable depending on the family of chip you’ve got

      Ah.

      To get a realistic sense of things we’d need to be able to measure that frequency, I’d imagine.

      That’s an important diagnostic step, but it only makes sense once you can measure some slowdown. If your program runs at the same speed, then there is nothing to investigate. No story.

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax