Recent Intel processors have fancy instructions operating over 512-bit registers. They are reported to cause a frequency throttling of the core where they are run, and possibly of other cores in some cases. Thus, it has been recommended to avoid AVX-512 instructions. I have written a series of blog posts on the topic trying to reproduce the effect. Though I can measure some level of performance degradation, if I work hard, I simply cannot find the “obvious” performance degradations (50%) that are often advertised. I tested on two distinct processors. I tried single-threaded and multi-threaded code.
There is more to the story than appears at first.
Travis Downs wrote a fancy tool to investigate the issue. Let me reproduce some of his findings in my own words. According to Intel’s documentation, there are two types of AVX-512, light instructions (e.g., integer additions) and heavy instructions (e.g., multiplications). Heavy instructions reportedly cause a much greater frequency throttle. None of my tests showed that. Travis found that it is quite hard to trigger:
Even a stream of 1 FMAD [fused multiply–add] every 4 or even 2 cycles doesn’t set the frequency down lower. The lowest speed is only reached if FMAs [fused multiply–add] come at a rate of more than 1 every 2 cycles.
As far as I can tell, this is absent from Intel’s documentation. If Travis is right, and I have no reason to doubt him, this means that the reported massive frequency throttling (slowest license) that we find everywhere online (including on Intel’s site) requires substantial qualification. Few people will ever achieve the rate of sustained heavy instructions that Travis documents.
For example, if you use AVX-512 to for pattern matching (Intel Hyperscan), to code and decode base64, or to compress and uncompress integers, you are probably never going to trigger massive throttling. If you do a lot of cryptography, machine learning or number crunching, the story might be different.
It is important to take into account how much you gain in the first place by going to AVX-512. For example, openssl found that a particular cryptographic routine involving many multiplications ran 30% faster on a per-cycle basis with AVX-512. Once you factor in some throttling, it is easy to see how it could be wasteful. So maybe a sensible approach is to ensure that you make substantial gains when using AVX-512 if it involves many heavy instructions.
Update: The same holds true for AVX (256-bit) instructions. For AVX instructions to lead to any throttle at all, you have sustain expensive instructions repeatedly every 1 or 2 cycles.
Further reading: AVX-512: when and how to use these new instructions