The dangers of AVX-512 throttling: myth or reality?

Modern processors use many tricks to go faster. They are superscalar which means that they can execute many instructions at once. They are multicore, which means that each CPU is made of several baby processors that are partially independent. And they are vectorized, which means that they have instructions that can operate over wide registers (spanning 128, 256 or even 512 bits).

Regarding vectorization, Intel is currently ahead of the curve with its AVX-512 instruction sets. They have the only commodity-level processors able to work over 512-bit registers. AMD is barely doing 256 bits and your phone is limited to 128 bits.

The more you use your CPU, the more heat it produces and the more energy it uses. Intel does not want your CPU to burn out or to run out of power. So it throttles the CPU (makes it run slower). Your CPU stays warm but not too hot, it does not use too much power. When the processor does not have to use AVX-512 instructions, some of the silicon remains dark, and thus less heat is generated and less power is consumed.

The cores that are executing SIMD instructions may run at a lower frequency. Thankfully, Intel has per-core voltage and frequencies. And Intel processors can switch frequency fast, in milliseconds.

Vlad Krasnov from Cloudfare wrote a blog post last year warning us against AVX-512 throttling:

If you do not require AVX-512 for some specific high performance tasks, I suggest you disable AVX-512 execution on your server or desktop, to avoid accidental AVX-512 throttling.

I am sure that it is the case that AVX-512 can cause problems for some use cases. It is also the case that some people die if you give them aspirin; yet we don’t retire aspirin.

Should we really disable AVX-512 as a precautionary stance?

In an earlier blog post, I tried to measure this throttling on a server I own but initially found no effect whatsoever. (Update: The fuller story is that I was the victim of a GNU LIBC bug.)

Vlad offered me test case in C. His test case involves AVX-512 multiplications, while much of the running time is spent on some bubble-sort routine. It can run in both AVX-512 mode and in the regular (non-AVX-512) mode. To be clear, it is not meant to be a demonstration in favour of AVX-512: it is meant to show that AVX-512 can be detrimental.

I did not want to run my tests using my own server this time. So I went to Packet and launched a powerful two-CPU Xeon Gold server (Intel Xeon Gold 5120 CPU @ 2.20GHz). Each of these processors have 14 cores, so we have 28 cores in total. Because of hyperthreading, it supports up to 56 physical threads (running two threads per core).

threadsAVX-512 disabledwith AVX-512
208.4 s7.2 s
406 s5.0 s
805.7 s4.7 s

In an earlier test, I was using an older compiler (GCC 5) and when exceeding the number of physically supported threads (56), I was getting poorer results, especially with AVX-512. I am not sure why that would be but I suspect it is not related to throttling; it might have to do with context switching and register initialization (though that is speculation on my part).

In my latest test, I use a more recent compiler (GCC 7). As you can see, the AVX-512 version is always faster. Otherwise, I see no negative effect from the application of AVX-512. If there is throttling, it appears that the benefits of AVX-512 offset it.

My code is available along with all the scripts and the outputs for your inspection. You should be able to reproduce my results. It is not like Xeon Gold processors are magical faeries: anyone can grab an instance. For the record, the bill I got from Packet was $2.

Update: Later, Travis Downs reviewed this benchmark and found that it was almost designed to make AVX-512 look good. If one changes the parameters, it is easy to measure the throttling effect.

Note: I have no conflict of interest to disclose. I do not own Intel stock.

Instructions: On Packet, hit “deploy servers”. Choose Sunnyvale CA and choose m2.xlarge.x86. I went with the latest Ubuntu (18.04). You can then access it through ssh. It is a pain that GCC is not already installed once the server is deployed, so that sudo apt-get install gcc fails. I fixed the problem by prepending “us.” to hyperlinks in /etc/apt/sources.list and running sudo apt-get update. (You may achieve this result with sudo sed -i 's|http://|http://us.|g' /etc/apt/sources.list.)

Further reading: AVX-512: when and how to use these new instructions

Published by

Daniel Lemire

A computer science professor at the Université du Québec (TELUQ).

22 thoughts on “The dangers of AVX-512 throttling: myth or reality?”

  1. Is your test program spending too much time in AVX-512 instructions (arr[i] = arr[i]*factor) and benefitting too much from AVX-512? Using AVX-512 instructions only occasionally seems to cause the problem because the core frequency stays low and the overall program is slower than not using AVX-512. I haven’t benched this myself, but IIRC, that’s what the Cloudflare team reported.

  2. I noticed that your compiler options for the no AVX512 compile doesn’t disable AVX512F, which seems to cause GCC 7.3 to still generate AVX512 code (though GCC 8.1 doesn’t). I see that you’re using GCC 5.4, which, from what I can tell, does the same thing and still gives you AVX512 code: https://godbolt.org/g/5pdUM4

    Ideally, you should be adding -mno-avx512f (and probably disable other AVX512 options, just in case) to the non-AVX512 build, or use something like -march=skylake instead of native.

    For AVX512 frequency scaling, here’s some info regarding frequencies:
    https://twitter.com/InstLatX64/status/903980178321395716
    https://pbs.twimg.com/media/DSeNw4_XkAAcSvh.jpg:large
    Generally Skylake-SP throttles the most, whereas Core-X/Xeon-W don’t suffer as much.

    1. Ideally, you should be adding -mno-avx512f (and probably disable other AVX512 options, just in case) to the non-AVX512 build, or use something like -march=skylake instead of native.

      I did this and re-ran my tests, and updated the results. Thanks.

  3. Those timings look very close for the low thread counts. Did you verify that “bubblenoavx512” is in fact not using AVX512? I just tried on with gcc-5.4 on Skylake-X, and it appears that “gcc -O3 -o bubblenoavx512 bubble.c -march=native -mno-avx512dq -lpthread” creates an executable that uses a lot of %zmm registers and instructions. I’m not sure exactly which instructions are expected to cause a slowdown, and perhaps it’s working the way you intend, but it would be good to confirm that the compiler is creating the code that you expect.

          1. How much of the time is spent in the AVX-512 portion and how much in the “scalar” portion?

            There is going to be a tradeoff point with “enough” 512-bit work where AVX-512 makes sense: the frequency slowdown is outweighed by doubling the FMA width. On the side of that tradeoff point lies much real-world code which doesn’t have a big block of FMAs somewhere.

            So this result would be interesting if the vectorized FMA work is a relatively small part of the total. If it’s a big part, not as much.

            It also depends on the AVX and AVX-512 turbo frequencies, but I can’t find reliable figures published by Intel.

    1. The AVX slowdown stuff is confusing in that there are actually three speed tiers, let’s call them fast tier, mid tier, slow tier. Intel calls them something like turbo, AVX2 turbo, AVX-512 turbo – but the latter two names are confusing because not all AVX2 instructions lead to AVX2 turbo, and same for AVX-512.

      AFAIK, the instructions you can execute while still staying in each tier are roughly like this [note 1]:

      Fast tier: all scalar instructions, all 128-bit SIMD, cheap AVX/AVX2
      Mid tier: all AVX2, cheap AVX-512
      Slow tier: (remaining AVX-512)

      So both AVX2 and AVX-512 are logically divided into cheap and expensive instructions, with the “cheap” instructions only costing as much as the expensive “smaller width” instructions (e.g., 512-bit cheap instructions cost as much as expensive 256-bit instructions, etc).

      This is true on recent server chips, but not on some older chips.

      I don’t have a handy list of what’s cheap as expensive, but things like reg-reg moves, memory loads and stores, basic integer math and bit operations are “cheap” and things like floating point math is “expensive”.

      [note 1]
      Note that my instruction naming isn’t really correct, but seems to follow how most people explain it: I’m referring to AVX/AVX2/AVX-512 but the actual distinguishing feature is the width of the access: so you should read AVX/AVX2 as “256-bit instructions/ymm regs” and AVX-512 as “512-bit instructions/zmm regs”. If you use 128-bit instructions introduced in AVX2 or 256-bit (ymm) versions of AVX-512 instructions, they act the same as the smaller width instruction sets (SSE or AVX/AVX2 respectively).

      1. For what it’s worth, I was able to run some tests on the Skylake W-2104 4 CPU machine, and the downclocking isn’t as bad as teh above. In particular, the frequencies associated with the expensive/heavy instructions like FP and integer multiplication only kick in if you do a lot (like 1 every couple of cycles).

        So just running the occasional FMAD or whatever else you have in the heavy category won’t put in the penalty box associated those instructions. The light instructions did behave as expected however: AVX2 light was just as fast as scalar, and AVX-512 light went to the middle speed tier.

        I measured the speed tiers on the W-2104 as 3.2 GHz (published nominal frequency), 2.8 GHz (measured middle/AVX2 tier), and 2.4 GHz (measured slow/AVX-512 tier). The frequencies were the same no matter how many active cores.

        I put more details here and the code I used to test this is here.

  4. I don’t think you can draw a broad conclusion from this one test.

    There is little doubt that AVX512 causes throttling that can make some (many?) uses slower than not using it – but that this test happens not to show it.

    Here are two “heuristic arguments” in favor of AVX-512 throttling being a problem in practice:

    1) Intel actually changed their icc compiler so that new versions are much more reluctant to use AVX-512 even for code that would “locally” benefit from it. If you want to use the old “AVX-512 happy” behavior, you now have to specify a qopt-zmm-usage argument at compile time. Details here (note it still defaults to “high” use of AVX-512 for Phi/KNL/KNB type chips that exist primarily to execute AVX-512).

    We should assume Intel is not stupid and changed the default AVX-512 use due to real-world slowdowns reported by real customers. Given that AVX-512 is a primary competitive advantage versus both x86 rivals (AMD) and non-x86 rivals, one would reasonably assume Intel would neuter AVX-512 use in their compiler only very reluctantly.

    2) Even if this test does not show a slowdown: one can be sure that it exists if we agree on two simple facts: (A) there is an actual slower AVX-512 speed (in fact, there are 3 speed tiers, but a single heavy AVX-512 instruction puts you in the slowest tier, AFAIK) (B) the benefit from AVX-512 can be arbitrarily small (for example, a single instruction).

    Combining those, the worst-case slowdown could be as bad as the full difference between the AVX-512 frequency and the frequency you would otherwise be running at (as mentioned, this might be one the other two speed tiers out of the total of three depending on what instructions you use). Actually the penalty could be slightly worse than that for carefully crafted code that triggers frequency transitions at a high rate on purpose (although Intel has a type of watchdog that prevents that from getting too extreme).

    1. I don’t think you can draw a broad conclusion from this one test.

      I do not draw broad conclusions.

      I am asking to see the evidence. Why are we speculating? Is it the $2 that costs renting a Packet server for an hour? Is that what is preventing people from coming up with their own measurements?

      This hardware is now easily accessible. We should just go out and measure.

      Intel certainly does not go out of its way to market AVX-512. It is some sort of evidence against AVX-512, I will grant you that… but there may be all sorts of explanations for that, including internal politics.

      If the hardware was unaccessible, we could go on with this kind of evidence. But here and now, we can measure!

      1. I do not draw broad conclusions.

        Yes, the post is perhaps carefully written not to explicitly draw a broad conclusion – but if we are not meant to extrapolate from this artificial example to something broader, why is it interesting at all?

        Only to point out that the original Cloudflare blog result could not be reproduced?

        The whole post is framed in terms of the question of whether AVX-512 is harmful in general, then a test is performed and results presented – and we aren’t supposed to draw any link between the results and a possible answer to the general question?

        I am asking to see the evidence. Why are we speculating?

        I don’t think speculation is required. We know AVX-512 throttles down, Intel describes and documents this (although they like to keep it hidden and not on ARK). We already knew of this effect with some Haswell and Broadwell AVX2 chips. The implication is “obvious”: your chip will run X% slower and as compensation you’ll gain access to 512-bit instructions.

        The cloudfare blog only puts numbers to what was already known, and is interesting in both a “cold, hard performance numbers sense” and also as a non-nonsense introduction to people who didn’t know about reduced AVX-512 speeds before. For those aware of the reduced frequency it is just additional confirmation of a hypothesis already well modeled even without real hardware. Anyways, even without hardware, a performance model combined with known/published facts about CPU frequency is still “evidence”, right?

        Is it the $2 that costs renting a Packet server for an hour? Is that what is preventing people from coming up with their own measurements?

        Don’t doubt that many people for whom this matters a lot have already tested this well, as they are paid directly or indirectly to do. Other people probably haven’t gotten around to AVX-512, but will test it in the future. Some people will never test and may end up running sub-optimally. Most of those won’t make it into a blog or any public forum though.

        The $2 isn’t much of a blocker, but that is only an accurate figure if you value your time at zero or are inherently interested enough that you get some positive value for investigating this.

        Intel certainly does not go out of its way to market AVX-512. It is some sort of evidence against AVX-512, I will grant you that… but there may be all sorts of explanations for that, including internal politics.

        I didn’t say anything about marketing. Actually I think Intel markets AVX-512 heavily, at least as much as any other technical feature is marketed (i.e., you aren’t going to hear about it on TV). It features prominently in most materials for the few chips that have it, and you can find a lot of materials extolling its virtues and putting forward use-cases and performance numbers.

        What I said is that despite that AVX-512 is the most prominent new feature in Skylake-SP/Skylake-X, and that Intel is highly incented to make their icc compile code which runs as fast possible on Intel and not on AMD[1], they had to change their compiler to generate much less AVX-512 code and this happened quietly in a dot-upgrade (e.g., something like 18.2 to 18.3). That’s not internal politics.

        Do you think Intel is somehow sabotaging their most prominent and important ISA enhancement since probably x86-64 by reducing the use of AVX-512 and being dishonest about why they are doing it?

        Maybe it’s better to be direct and ask what your interpretation of your results is?

        Do you think that Intel CPUs actually do not have lower AVX-512 speeds as widely understood and as Intel themselves describes?

        Do you think that they do have such lower speeds, but don’t actually throttle down to them in certain cases?

        Do you think that there is some unknown interaction where large blocks of mixed AVX-512 and non-AVX-512 code can somehow be faster than a simple model of a linear combination of the two parts, scaled by the lower frequency would predict?

        All of those are, IMO, extraordinary claims and require extraordinary evidence.

        Given this one counter-example, without details such as the (a) observed frequency for the various tests and (b) the faction of runtime in the AVX-512 vs scalar bubble-sort parts, my gut reaction is to guess that it is none of the above but rather one of:

        Test performs a lot of work in the AVX-512 part, which gets a 2x speedup, more than making up for the slower frequency (an expected, unsurprising result). Why the test you received from cloudflare would be configured that way is still interesting.
        Hardware is configured unexpectedly, e.g., with turbo disabled.
        Flaw in test, e.g., in measurement routines, too few iterations or other.
        Scalar and vector portions are interleaved in a very fine-grained matter making the linear-combination model inapplicable and meaning one part can dominate runtime (this is just a special case of (1)).
        Other unexpected “non-general” interaction, e.g., due to compiler flaw/limitation it emits bad code for the non-AVX512 path when it doesn’t have to.

        I’m not in any way arguing that AVX-512 is useless. Quite the opposite: I’m excited about it and it provides a lot of power. The frequency/ISA tradeoff though is completely real and makes isolated use of AVX-512 risky from a performance PoV.

        [1] If their code takes an different slow path on AMD even better – and Intel’s icc does this to this day (purposely using the vendor string to exclude competitor chips from the fast path code even if such a chip supports all the features). You’ll find a disclaimer to this effect at the bottom of practically every Intel webpage (result of a settlement for an investigation to determine if this was anti-competitive).

    2. There is little doubt that AVX512 causes throttling that can make some
      (many?) uses slower than not using it – but that this test happens not
      to show it.

      I’m not following the issue very closely, but I guess I still have some doubt. One part I don’t understand yet is which machines are affected. For the AVX2 throttling, I think it was just the High Core Count (HCC) processors. Is AVX512 throttling similarly restricted to certain processors? Do you have an example that shows the effect on Daniel’s Skylake-X machine?

      1. Yeah the situation in AVX/2 was less severe (I’ll use that notation since it’s really both AVX and AVX2 that are affected – nothing special about AVX2). As you say, only a small slice of some server chips had the lower AVX/2 frequency. Also, the server chips rolled out (as usual) way after the consumer chips so at least for all the initial use and benchmarks there was no such difference in speeds. Only later did this appear and it was novel at the time.

        Skylake-SP/X is different. Here, the server chips came out first with AVX-512 and AFAIK the vast majority have lower AVX512-speeds. There are also now three speed bins and the slowest is pretty sloe!, relatively. Add to that, some chips only had 1 FMA-512 unit, so on those chips AVX-512 was really unappealing.

        One problem is that the frequencies for any single chip now form a large table with 3 rows for the three speeds and the number of active cores, so good luck finding a table showing all chips and speeds at a glance!

        Wikichip seems to have it for the Xeon Gold 5120 Daniel used though (here: https://en.wikichip.org/wiki/intel/xeon_gold/5120 ). Note that the ratio between fast and slow is small for 1 or 2 cores active, but it is really bad for 9 cores or more. So now you have this additional dimension of active cores which Daniel covered a bit above (but no 1 core numbers which naively would seem likely to make his case stronger).

  5. I would be interested in seeing benchmarks for workloads which are unlikely to be CPU bound with CPU frequency varied at a fixed interval. Such benchmarks would demonstrate that, for some problems, small perturbations in size are better predictors of relative performance than small perturbations in frequency, i.e. that the workload in question is effectively immune to throttling.

  6. Given this one counter-example, without details such as the (a)
    observed frequency for the various tests and (b) the faction of
    runtime in the AVX-512 vs scalar bubble-sort parts, my gut reaction is
    to guess that it is none of the above but rather one of: …

    My guess (untested) is that the test as written may not actually be CPU limited, and thus changes in core frequency might not have the adverse impact we expect. The C code does an increments an shared atomic variable on each iteration, which on a multi-socket machine would be 1) not cheap, and 2) not dependent on core frequency. Presumably, the vectorized code is doing less frequent incrementing — once per vector size?

    I wonder if doing half as many increments (AVX2 to AVX512) at a constant uncore frequency might more than offset the less than 50% drop in per core frequency. I don’t blame Daniel for this (he’s using someone else’s test) but it seems like an unnecessary confounder. At the least, it would be nice to know that the test is directly impacted by CPU frequency.

    1. Indeed, the most likely case is “testing error”, or not measuring what you think you’re measuring.

      An incidental effect like that could absolutely cause the observed behavior. Also: things like contention are strongly non-linear in behavior so even if the measured runtime of the atomic part is fairly small, it could have an outsized impact – eg it could be 10x slower to do 2x as many increments.

      I guess I’m slowly getting auto-roped into actually looking at this for real :).

  7. For possible related causes of slowdown when using AVX, this bug is a really good read: https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/1663280

    Super short version was that mixing SSE (128-bit) and AVX-256 which uses the same registers but “extended” to 256bits on some CPUs requires the CPU to save the register state between calls, a “transition penalty” – this penalty only exists on some modern CPUs and not others. glibc and the application were interchangeably using SSE/AVX which ruined performance 4x+ in some cases (seen with LINPACK that had AVX512 disabled for example). In this case, glibc was using some SSE instructions resulting in the application code suffering from transition penalties when using AVX instructions.

    In this bug, the triggering code was in _dl_runtime_resolve which was frequently called to dynamically resolve symbols from external libraries.

    While this is only AVX-256, my general point is that perhaps these issues are more common with some kind of “side effect” such as this transition penalty rather than purely just from AVX-512 itself. However I am not an architecture guru and I have no idea how you might run into this situation, so I’ll leave that up to the reader!

    1. This bug description is somewhat misleading. Mixing 128-bit and 256-bit registers came with a penalty on older Intel processors. However, there is no such penalty on recent Intel processors. AMD is unaffected.

      1. Yes, there is a penalty on recent processors, and way it manifests is arguably much worse than on the old.

        On the old processors, there was a “one time” transition penalty when you transitioned between the various states, e.g., when you executed a (non-VEX) SSE instruction after an AVX instruction, you’d suffer a penalty while the processor stashed away the high bits, but then you’d run at full speed, at least until you suffered the next transition penalty.

        So unless you actually have fine-grained VEX and non-VEX mixing in your application code, you’d probably suffer perhaps a limited number of penalties due to bad behavior by say the runtime loading code (like the bug).

        On new CPUs, however, the transition penalties are gone, but they are replaced with a new ongoing penalty for all SSE instructions when you are in the “dirty upper state”, which now merge on every ALU operation the results of the SSE instruction with the unaffected upper-bits. This makes every destination register have a “false” dependency on the prior value of the register (“false” in quotes because somehow the dependency is real) and adds many merging uops (1 per instruction basically).

        The problem it the penalty is “forever” pretty much. If you have some SSE that was always been fast and awesome before, and one stray VEX instruction occurs in any random library, runtime loader, glibc implementation or interrupt handler you’ll enter slow mode and not leave it (indeed, it is more or less impossible for you to leave it since as an SSE program you don’t even know about vzeroall or vzeroupper which are what get you out of that state).

        At least the penalty only applies to SSE code though, not AVX code.

        This stackoverflow answer has more details including diagrams.

        BTW, the glibc bug is badly confused. It is mixing up the two penalty behaviors. It implies that the problem is transition penalties, but later says the penalty is ongoing. The real problem was with new chips and not old since _dl_runtime_resolve (a function which is called a limited number of times mostly near the start of the program) was dirtying the upper halves which causes a permanent slowdown for SSE-using code. In may have also affected earlier chips, but only in a very, very limited way by adding some transition penalties: but this is only one per _dl_runtime_resolve call, which itself generally takes longer than the penalty cost, so there is “6x slowdown” type scenarios like in the SO answer.

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax