Intel is finally making available processors that support the fancy AVX-512 instruction sets and that can fit nicely in a common server rack. So I went to Dell and ordered a server with a Skylake-X microarchitecture: an Intel Xeon W-2104 CPU @ 3.20GHz.
This processor supports several interesting AVX-512 instruction sets. They are made of very powerful instructions that can manipulate 512-bit vectors.
On the Internet, the word out is that using AVX-512 in your application is going to slow down your whole server, so you should just give up and never use AVX-512 instructions.
If you do not require AVX-512 for some specific high-performance tasks, I suggest you disable AVX-512 execution on your server or desktop, (…)
Table 15-16 in Intel’s optimization manual describes the impact of the various instructions you use on “Turbo Boost” (one of Intel’s frequency scaling technology). The type of instructions you use determines the “license” you are in. If you avoid AVX-512 and heavy AVX2 instructions (floating-point instructions and multiplications), you get the best boost. If you use light AVX-512 instructions or heavy AVX2 instructions, you get less of a boost… and you get the worst results with heavy AVX-512 instructions.
Intel sends us to a sheet of frequencies. Unfortunately, a quick look did not give me anything on my particular processor (Intel Xeon W-2104).
Intel is not being very clear:
Workloads that execute Intel AVX-512 instructions as a large proportion of their whole instruction count can gain performance compared to Intel AVX2 instructions, even though they may operate at a lower frequency. It is not always easy to predict whether a program’s performance will improve from building it to target Intel AVX-512 instructions.
What I am most interested in, is the theory that people seem to have that if you use AVX-512 sparingly, it is going to bring down the performance of your whole program. How could I check this theory?
I picked up a benchmark program that computes the Mandelbrot set. Then, using AVX-512 intrinsics, I added AVX-512 instructions to the program at select places. These instructions do nothing to contribute to the solution, but they cannot be trivially optimized away by the compiler. I used both light and heavy AVX-512 instructions. There are few enough of them so that the overhead is negligible… but if they slowed down the processor in a significant manner, we should be able to measure a difference.
|mode||running time (average over 10)|
|no AVX-512||1.48 s|
|light AVX-512||1.48 s|
|heavy AVX-512||1.48 s|
Using spurious AVX-512 instructions made no difference to the running time in my tests. I don’t doubt that the frequency throttling is real, as it is described by Intel and widely reported, but I could not measure it.
This suggests that, maybe, it is less likely to be an issue than is often reported, at least on the type of processors I have. Or else I made a mistake in my tests.
In any case, we need reproducible simple tests. Do you have one?
Update: An anonymous source reports:
I believe that the reason you do not see any slowdown is that the Xeon W-2104 part does not enable the port 5 VPU. Furthermore, AVX-512 load-store instructions aren’t going to trigger the frequency drop. The worst instructions for that are float64 FMAs (or equivalent). I modified your code so that it runs on all the cores (via MPI) and does float64 FMAs or ADDs on the heavy and non-heavy AVX-512 paths, respectively. The performance effect I see looks like noise, but if it’s real, it’s less than 0.5% slowdown from ZMM usage.