AVX-512: when and how to use these new instructions

Note: This blog post from 2018 predates the Intel Ice Lake (Sunny Core) and AMD Zen 4 processors. These newer processors are not affected by the issues described in this blog post.

Our processors typically do computations using small data stores called registers. On 64-bit processors, 64-bit registers are frequently used. Most modern processors also have vector instructions and these instructions operate on larger registers (128-bit, 256-bit, or even 512-bit). Intel’s new processors have AVX-512 instructions. These instructions are capable of operating on large 512-bit registers. They have the potential of speeding up some applications because they can “crunch” more data per instruction.

However, some of these instructions use a lot of power and generate a lot of heat. To keep power usage within bounds, Intel reduces the frequency of the cores dynamically. This frequency reduction (throttling) happens in any case when the processor uses too much power or becomes too hot. However, there are also deterministic frequency reductions based specifically on which instructions you use and on how many cores are active (downclocking). Indeed, when any 512-bit instruction is used, there is a moderate reduction in speed, and if a core uses the heaviest of these instructions in a sustained way, the core may run much slower. Furthermore, the slowdown is usually worse when more cores use these new instructions. In the worst case, you might be running at half the advertised frequency and thus your whole application could run slower. On this basis, some engineers have recommended that we disable AVX-512 instructions default on our servers.

So what do we know about the matter?

  1. The term “AVX-512” can describe instructions operating on various register lengths (128-bit, 256-bit and 512-bit). When discussing AVX-512 downclocking, we mean to refer only to the instructions acting on 512-bit registers. Thus you can “safely” benefit from many new AVX-512 instructions and features such as mask registers and new memory addressing modes without ever worrying about AVX-512 downclocking, as long as you operate on shorter 128-bit or 256-bit registers. You should never get any downclocking when working on 128-bit registers.
  2. Downclocking, when it happens, is per core and for a short time after you have used particular instructions (e.g., ~2ms).
  3. There are heavy and light instructions. Heavy instructions are those involving floating point operations or integer multiplications (since these execute on the floating point unit). It seems like the leading-zero-bits AVX-512 instructions are also considered heavy. Light instructions include integer operations other than multiplication, logical operations, data shuffling (such as vpermw and vpermd) and so forth. Heavy instructions are common in deep learning, numerical analysis, high performance computing, and some cryptography (i.e., multiplication-based hashing). Light instructions tend to dominate in text processing, fast compression routines, vectorized implementations of library routines such as memcpy in C or System.arrayCopy in Java, and so forth.
  4. Intel cores can run in one of three modes: license 0 (L0) is the fastest (and is associated with the turbo frequencies written on the box), license 1 (L1) is slower and license 2 (L2) is the slowest. To get into license 2, you need sustained use of heavy 512-bit instructions, where sustained means approximately one such instruction every cycle. Similarly, if you are using 256-bit heavy instructions in a sustained manner, you will move to L1. The processor does not immediately move to a higher license when encountering heavy instructions: it will first execute these instructions with reduced performance (say 4x slower) and only when there are many of them will the processor change its frequency. Otherwise, any other 512-bit instructions will move the core to L1: the processor stops and changes its frequency as soon as an instruction is encountered.On server-class processor, the downclocking is determined on a per-core basis based on the license and the total number of active cores, on the same CPU socket, irrespective of the license of the other cores. That is, to determine the frequency of core under downclocking, you need only to know its license (determined by the type of instructions it runs) and count the number of cores where code is running. Thus you cannot downclock other cores on the same socket, other than the sibling logical core when hyperthreading is used, merely by running heavy and sustained AVX-512 instructions on one core. If you can isolate your heavy numerical work on a few cores (or just one), then the downclocking is limited to these cores. On linux, you can control which cores are running your processes using tools such as taskset or numactl.You will find tables online like this one, for the Intel Xeon Gold 5120 processor…
    mode 1 active core 9 active cores
    Normal 3.2 GHz 2.7 GHz
    AVX2 3.1 GHz 2.3 GHz
    AVX-512 2.9 GHz 1.6 GHz

    We have chosen to only include two columns. The frequency behavior is the same for 9 to 12 cores and is the worst case for the L2 license. When you have more than 9 active cores, there is no further downclocking documented for the L2 license.

    These tables are somewhat misleading. The row “AVX-512” (the L2 license) really means “sustained use of heavy AVX-512 instructions”. The row “AVX2” (L1 license) includes all other use of AVX-512 instructions and heavy AVX2 instructions. That is, it is wrong to assume that the use of any AVX-512 instruction puts the cores into the frequency indicated by the AVX-512 row.

    These tables do give us some useful information, however:

    • a. These tables indicate that frequency downclocking is not specific to AVX-512. If you have many active cores, you will get downclocking in any case, even if you are not using any AVX-512 instructions.
    • b. If you just use light AVX-512, even if it is across all cores, then the downclocking is modest (15%).
    • c. If you are doing sustained heavy numerical work while many cores are active, then the downclocking becomes significant on these cores (~40%).

Should you be using AVX-512 512-bit instructions? The goal is never to maximize the CPU frequency; if that were the case people would use 14-core Xeon Gold processor with a single active core. These AVX-512 instructions do useful work. They are powerful: having registers 8 times larger can allow you to do much more work and far reduce the total number of instructions being issued. We typically want to maximize the amount of work done per unit of time. So we need to make engineering decisions. It is not the case that a downclocking of 10% means that you are going 10% slower, evidently.

Here are some pointers:

  1. Engineers should probably use tools to monitor the frequency of their cores to ensure they are running in the expected license. Massive downclocking is then easily identified. For example, the perf stat command on Linux can be used to determine the average frequency of any process, and finer grained details are available using the CORE_POWER.LVL0_TURBO_LICENSE event (and the identical events for LVL1 and LVL2).
  2. On machines with few cores (e.g., standard PC), you may never get the kind of massive downclocking that we can see on a huge chip like the Xeon Gold processor. For example, on on Intel Xeon W-2104 processor, the worse downclocking for a single core is 2.4 GHz compared to 3.2GHz. A 25% reduction is frequency is maybe not an important risk.
  3. If your code at least partly involves sustained use of heavy numerical instructions you might consider isolating this work to specific threads (and hence cores), to limit the downclocking to cores that are taking full advantage of AVX-512. If this is not practical or possible, then you should mix this code with other (non-AVX-512 code) with care. You need to ensure that the benefits of AVX-512 are substantial (e.g., more than 2x faster on a per cycle basis). If you have AVX-512 code with heavy instructions that runs 30% faster than non-AVX-512 on a per-cycle basis, it seems possible that once it is made to run on all cores, you will not be doing well.For example, the openssl project used heavy AVX-512 instructions to bring down the cost of a particular hashing algorithm (poly1305) from 0.51 cycles per byte (when using 256-bit AVX instructions) to 0.35 cycles per byte, a 30% gain on a per-cycle basis. They have since disabled this optimization.
  4. The bar for light AVX-512 is lower. Even if the work is spread on all cores, you may only get a 15% frequency on some chips like a Xeon Gold. So you only have to check that AVX-512 gives you a greater than 15% gain for your overall application on a per-cycle basis.
  5. Library providers should probably leave it up to the library user to determine whether AVX-512 is worth it. For example, one may provide compile-time options to enable or disable AVX-512 features, or even offer a runtime choice. Performance sensitive libraries should document the approach they have taken along with the likely speedups from the wider instructions.
  6. A significant problem is compiler inserted AVX-512 instructions. Even if you are not using any explicit AVX-512 instructions or intrinsics, compilers may decide to use them as a result of loop vectorization, within library functions and other optimization. Even something as simple as copying a structure may cause AVX-512 instructions to appear in your program. Current compiler behavior here varies greatly, and we can expect it to change in the future. In fact, it has already changed: Intel made more aggressive use of AVX-512 instructions in earlier versions of the icc compiler, but has since removed most use unless the user asks for it with a special command line option.Based on some not very comprehensive tests of LLVM’s clang (the default compiler on macOS), GNU gcc, Intel’s compiler (icc) and MSVC (part of Microsoft Visual Studio), only clang makes agressive use of 512-bit instructions for simple constructs today: it used such instructions while copying structures, inlining memcpy, and vectorizing loops. The Intel icc compiler and gcc only seem to generate AVX-512 instructions for this test with non-default arguments: -qopt-zmm-usage=high for icc, and -mprefer-vector-width=512 for gcc. In fact, for most code, such as the generated copies, gcc seems to prefer to use 128-bit registers over 256-bit ones. MSVC currently (up to version 2017) doesn’t support compiler generated use of AVX-512 at all, although it does support use of AVX-512 through the standard intrinsic functions.From the compiler’s perspective, deciding to use AVX-512 instructions is difficult: they often provide a reasonable local speedup, but at the possible cost of slowing down the entire core. If such instructions are frequent enough to keep the core running in the L1 license, but not frequent enough to produce enough of a speedup to counteract the slowdown, the program may run slower overall after you recompile to support AVX-512. It is hard to give general recommendations here beyond compiling your program both with and without AVX-512 and benchmarking in a realistic environment to determine which is faster. Because of the large variation in AVX-512 behavior across active core counts and Intel hardware, one should ensure they match these factors as closely as possible when testing performance.

Future work:

  1. It seems that there is a market for a tool that would monitor the workload of a server and identify when and why downclocking occurs.
  2. Operating systems or application frameworks could assign threads to specific cores according to the type of instructions they are using and the anticipated license.

Final words: AVX-512 instructions are powerful. With great power comes great responsibility. It seems unwarranted to disable AVX-512 by default at this time. Instead, the usual engineering evaluations should proceed.

Credit: This post was co-authored with Travis Downs.

Daniel Lemire, "AVX-512: when and how to use these new instructions," in Daniel Lemire's blog, September 7, 2018.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

15 thoughts on “AVX-512: when and how to use these new instructions”

  1. You instruction stream density is what decides the frequency decrease you will have when you use AVX512. You frequency decrease can go from few 100Mhz to 1Ghz on the Xeon Gold. To understand how much you will lose in frequency, the Power Control Unit (PCU) will count the “unit of power” (it has a look up table for almost every operation in the SoC, including fabric and Cores). Some of the heavy instructions, like Fuse Multiply ADD (FMA) get a really high count of unit of energy, that is very likely that if you end up at 2.9Ghz on the Xeon, you are using ALOT of that. You are usually rewarded by the SIMD speed up of the 512bits if you have a high count of FMAs. Optimizing for AVX512 requires a good understanding of the instruction stream, and ASM optimization, AND a good understanding of how the PCU works. Here is Efi explaining how it was working in SandyB, the mechanisms have changed a little, but not enough to be not useful. https://www.hotchips.org/wp-content/uploads/hc_archives/hc23/HC23.19.9-Desktop-CPUs/HC23.19.921.SandyBridge_Power_10-Rotem-Intel.pdf
    Globally, if you are using instructions son of MMX, using int SIMD, if your stream is optimized properly you will end up around 3.5Ghz. IF you have a dense IS of SIMD FP (Instruction stream), you will end up at 2.9Ghz to 3.2Gz, Those numbers are only good on the Xeon Gold, if you using SKX, get to the UEFI settings, and equalize frequency of AVX2 and AVX512, and this will solve all of your problems. I never find a SKX (ExtremeEdition) that can not sustain the POR frequency doing AVX512, if you put it into a motherboard with a strong VR (Voltage regulation) and a 1000 Watt power supply.
    Good day !

    1. Are you saying there are more than three levels? Based on my tests and all the documentation I’ve seen there are only three levels, and when you enter them is more or less deterministic. The main remaining question mark is exactly how “dense” the wide FP instructions have to be, and how many you have to run, in order to trigger the L0 -> L1 and L1 -> L2 transitions.

      This puts aside actual chip-wide temperature, power or current throttling which is a different thing and doesn’t seem to kick in on the server chip we tested.

      1. Actually, they are not different thing , they are all linked inside the PCU, there are more than 3 levels if you experiment properly, when you get into the transitional phase where your code is dense, the PCU will adjust to try to shave the TDPs of your socket. This is why all dense workload using AVX512 do not all end up at 2.9GHz when doing so. You really need to get to understand the PCU if you want to understand the behaviors of the Xeons

        1. You might run into other types of throttling which cause the chips to deviate from the published turbo levels of the three licenses, especially if your cooling is inadequate or your chip has a low configured TDP, but in general our tests and the documentation seems to indicate that the three levels are largely what matters, especially for well-cooled, high TDP server chips.

          Note that there are three licenses, but the frequency levels depend also on the active core count, as described above, so the frequency is a two-dimensional matrix: on a 14-core chip like the 5120, there are 3 * 14 = 42 possible frequency levels. Note that these values are published by Intel!

          This is why all dense workload using AVX512 do not all end up at 2.9GHz when doing so.

          Based on our experiments, dense workloads ran at the expected frequency on all cores: which is only 2.9 GHz for 1 or 2 cores. For 9 or more cores, the frequency is 1.6 GHz, for example.

          Finally, there is also transition behavior, not described in this article, when entering and leaving the various licenses and also when changing core counts: but as far as I can tell this involves only throttling instruction dispatch and/or executing wide instructions on narrower units, and periods where the chip is halted, but not any new frequency levels.

          Do we agree yet?

          Note that you can run the same tests we did using avx-turbo.

  2. Interesting article, I recently did some energy experiments on an Intel Skylake Gold 6154 based machine and also had to tackle these issues.

    I found the Microway database insightful on AVX-based frequency reductions.


    They imply that the chip is free to do whatever it wishes so long as it does not violate the TDP constraints. I’m not sure if this contradicts your findings with 3 discrete license states or not. Maybe that is how they choose to implement this feature. Although this wouldn’t take into account processor binning. Therefore I assumed it would be finer-grained than this.

    On our heavily vectorised (AVX-512 flops) code we actually found the frequency drop was not as large as initially feared. This could have had something to do with effective water-cooling which keeps the temperatures down.

    Would be so much easier if Intel gave some hints…rather than leaving it to educated guesswork!

    1. They imply that the chip is free to do whatever it wishes so long as it does not violate the TDP constraints.

      Can you point me at the exact quote where they imply that the chip is free to do anything?

      What is certainly true is that above and beyond downclocking, there is TDP-related frequency throttling.

      That is, you cannot be sure that the chip will run at the specified frequency. For example, if it gets too hot, it might run slower, certainly.

      I would not describe that as the chip being free to do whatever it wishes. That would be quite a painful design for software engineers to cope with.

      1. Perhaps Ben is referring to the charts in the section “AVX-512, AVX, and Non-AVX Turbo Boost”, which imply that each of the licenses cover a range of frequencies (with the range being very large in the case of the single-core frequencies).

        What these charts are showing is the the turbo speed for the license and core count at the top of each interval and the published “base frequency” at the bottom of each interval. Since there is only a single published base frequency (which doesn’t depend on core count – hence should work even for max cores), the bottom limit of each range is the same for both graphs.

        Essentially this is a claim that you’ll get performance somewhere between the base frequency and the turbo, frequency which is correct in principle! Intel only really guarantees operation at the base frequency and speeds above that are “opportunistic” so you may not always get them depending on various factors. In practice, the turbo speeds are very reliable, unless you are doing something weird or you are using a very low TDP chip: you usually get exactly the max turbo you are allowed to get, consistently. People purchase chips based on that behavior too: you’d be pretty annoyed if for some reason you didn’t hit the published turbo frequencies.

        If someone has any evidence that Skylake Xeon chips consistently run at somewhere other than the published max turbo for the license and core count, with the standard TDP configuration (i.e., not setting a lower than expected TDP in the BIOS/firmware), I’d like to see it!

        1. @Travis that was exactly the point I was trying to make, although I agree I was a bit ambiguous.

          Intel publishes essentially a list of guarantees for the maximum and minimum frequencies on the various possible workloads: number of cores executing different type of instructions. The microway link I posted publishes these in a series of box plots. So when you buy a 3GHz chip this is the minimum for non vectorised operations on a single core I believe. Even though it will usually be able to run at near enough the maximum “TurboBoost” frequency.

          These guarantees are made to ensure that even the worst quality chips they push out can make those frequencies while staying within the TDP. In our testing we found that it often did much better than advertised. It never had to go down to the 1.6GHz AVX-512 minimum frequency.

          With this new knowledge of the licenses, I will try to find some time to go back over the data and see if there are any “clustering points” around these license frequencies.

          Maybe they are used as a hint to the core as to a rough frequency and then the PCU does the rest?

  3. Intel publishes essentially a list of guarantees for the maximum and minimum frequencies on the various possible workloads: number of cores executing different type of instructions.

    I have never seen published minimum values on a per-core basis. The microway link just uses the same minimum “base frequency” for the single-core and all-cores case, which is especially unrealistic (i.e., will never happen unless you put your CPU in an oven) for the single-core case. If you have a link to minimum per-core frequencies published by Intel, I’d like to see it!

    As far as I can tell, they publish only max turbo frequencies, and these are also the ones you care about because you’ll usually be running at that speed.

    Maybe they are used as a hint to the core as to a rough frequency and
    then the PCU does the rest?

    Here’s my rough model of how this works: first you have the deterministic published behavior described in this post, and also by Intel. This puts a hard cap on the max speed in a given configuration, and generally can’t be adjusted (outside of chips with “unlocked multipliers”. This works “deterministically” in a fairly simple way based on the core count + license charts, and is the same for every CPU of a given model. The only relevant numbers here are the “turbo” frequencies: I don’t think the base frequency ever comes into play. In general, the CPU will “try” to run at the speed looked up from the tables.

    These tables mean that CPUs will generally run “slower” if they are executing heavier instructions, or are running with more cores, but I wouldn’t tall this throttling: there are just different design speeds for different points in this matrix. So in some extend they are the modern equivalent of the marketed CPU speeds, but for marketing and sanity reasons Intel is of course not printing this matrix “on the box”.

    Then, behind that, you have a complex PCU layer which has several feed-forward and feed-back mechanisms to monitor the predicted and/or actual power, current and temperature levels and to potentially apply additional throttling based on various thresholds. This can only slow down your chip compared to the design speeds, never speed it up.

    For example, it may measure the instantaneous current and use that to calculate instantaneous power, and then insure that the power over some interval doesn’t exceed some threshold. It may use different thresholds for different time periods too: you may be allowed to run with a TDP of 130W for 20 seconds, but only 100W longer term. This type of speed adjustment by the PCU I would label as “throttling”. It may be implemented by changing the frequency/voltage (p-state change), or perhaps by clock gating to change the duty cycle (more instantaneous and fine-grained, but less efficient longer term).

    In addition to power the PCU will monitor temperature as kind of a last result: the TDP throttling should prevent the termperature from rising too high in normal circumstances, but it’s no guaranteed, e.g., if the ambient temperature is high, the cooling system isn’t working properly (vent blocked, etc) or the values are just too optimistic: if the temperature reaches some threshold, usually right around 100C you’ll again get throttling.

    Many of these “throttling” behaviors are somewhat configurable: the manufacturer (might be the system integrator, or the motherboard manufacturer or the cloud provider or whatever in various scenarios), can actually set some of these values, rather than use the default. For example a system with better than average cooling can set a higher TDP values: this doesn’t allow it to run faster than the max set by the matrix, but it might delay or eliminate TDP related throttling by the PCU: since that’s using the configurable thresholds. Similarly a cool & quiet system might want to set lower TDPs.

    These kinds of configurable thresholds are one reason you see CPU performance differences between different motherboards/systems even with the identical CPU (not the only reason: they have have slightly different base clocks too).

    Unlike the matrix lookup, this behavior isn’t really going to be deterministic from an end user point of view as it depends on many fine-grained internal and external factors. However, it is at least very observable: the PCU sets various bits in MSRs indicating what it did: throttling due to TDP limits, current throttling, temperature-throttling, etc – you can determine pretty much if any of them happened over any interval. Intel XTU on Windows shows some of this, but if you dig into the MSR you can get even more info.

    Since you have these two quite different methods to determine the actual frequency, the question naturally arises: which is more important? Are you usually running at the “max turbo” that is described by the license + active core matrix? Or are you usually running in some kind of throttled state described in the second half of my description above? It depends on the chip. For server chips, like the ones we’ve tested for this article with AVX-512, it is my impression and observation that they can usually run indefinitely at their “max turbo” without throttling, at least with reasonable cooling (which cloud providers probably have). This was true also on the Skylake W-2104 that we tested, a “workstation” chip (but it’s a small, cheap one). So in those scenarios I think you should consider the max turbo the dominant factor.

    The situation is different for chips and devices with restricted TDP. A 4-core 15W laptop chip is probably not going to be able to run with all cores at max speed indefinitely: you’ll exceed the TDP. It probably will run that way for a short burst, since Intel has these thresholds where you can exceed your TDP for 10 seconds or something like that, but then it will slow down to keep the TDP at the configured value.

    This kind of behavior is common for the chips that go in small and light devices like thin laptops and tablets. You can see it in the “throttling” tests that some reviews provide, showing performance over time for a sustained loads. It’s super prevalent in phones (which of course are not x86) and usually the dominant factor for sustained CPU loads over a minute or so.

    Not all laptop chips fall into this category though: I have a 45W i7-6700HQ (Skylake) and as far as I can tell it is OK to run pretty much any load on all cores at the max turbo (which on this chip varies only by core count, not by “license” – so that includes AVX2 FMA operations), although the fans do spin up and the CPU cores sit at about 90C).

    Well that’s my mental model anyways!

  4. After all that, I forgot to add the part about chip to chip variation that I was building towards…

    So the matrix will be the same for every chip of a given model, and the PCU algorithms and thresholds will likely also be the same, but two chips of the same model may run at slightly different frequencies and slightly different power draws depending on characteristics of that particular wafer, where the chip appeared in the wafer and minor process changes over time.

    So one chip might draw more power running the same load than another chip with the same model, which in the case that PCU throttling occurs, could mean that one chip runs faster. Of course this can also happen due to external factors, such as a server’s position within the rack, the socket’s position relative to the airflow, whatever.

    The key is that this should only apply to “throttling” type scenarios, and not to the usual operating mode that normally applies to servers. So you often won’t see any differences since you often don’t see throttling.

    For thin laptops and so on, you might be “always throttling” so it’s important there – but the external factors on laptops are huge and probably overwhelm any chip-to-chip variation: if you have it on your lap, the outside temperature, whether the GPU is being used, if the bottom vents are blocked, etc.

    About binning in general, remember that Intel is producing something like 20+ Xeon models from only 3 different dies, so there is already a huge opportunity for binning between the models since one die can make many different chips depending on its frequency response characteristics and any faulty cores.

    1. A comprehensive reply, thanks! I agree, and I think your mental model matches mine!

      I have never seen published minimum values on a per-core basis. The microway link just uses the same minimum “base frequency” for the single-core and all-cores case, which is especially unrealistic (i.e., will never happen unless you put your CPU in an oven) for the single-core case. If you have a link to minimum per-core frequencies published by Intel, I’d like to see it!

      Sorry I realised Microway didn’t publish it for the Skylake line (not sure why this is the case). In my experiments I was comparing it to a Broadwell chip for which they did:

      Under: “Top Clock Speeds for Specific Core Counts”

      I see no reason why the situation is not similar for Skylake and newer models.

      The key point from your post that resonated with me is that Intel has to be flexible. They do not know where this chip will end up. It might end up living the life of luxury in a watercooled rack on its own or crammed in to an overheating server next to 20 others! That’s why I find it hard to believe this “license” idea works. It’s impossible to prescribe a frequency for a certain instruction unless you know the operating conditions.

      1. That’s why I find it hard to believe this “license” idea works. It’s
        impossible to prescribe a frequency for a certain instruction unless
        you know the operating conditions.

        To be clear, I think the license idea works, and while it won’t always be the thing that determines the actual speed, in many situations it will often or nearly always be the thing that determines the speed, to the extent that for your analysis you can use the simplified model where the license model is the only thing that exists (this is particularly true for the server chips we are discussing).

        Most of the buyers who shell out thousands for a Skylake Xeon are likely to be professionals who will install it in a properly cooled rack, I think. In fact, the primary consumers to date are no doubt the big 3 (ish) cloud providers (Amazon, Google, Microsoft).

        Note that the license model isn’t something that we made it: Intel documents it themselves and these terms come straight from their documents. They even have performance counters in SKX that will tell you exactly what fraction of the time you spent in license 0, 1 or 2 as mentioned above (the CORE_POWER.LVL0_TURBO_LICENSE and related events).

    2. You understand that I was one of the performance Architect of Core and that I was known to be a very good software optimizer, just to make sure before we go forward.

      so, yes, there are hard limits, and it is not up for discussion, that are fact you can find here: https://en.wikichip.org/wiki/intel/xeon_gold/6154 (For the hard limits)

      The PCU does try to get you to your maximum TDP to maximize performance, so, if your instruction stream does not include any particular instruction set, you are likely to end up higher than the supposed bas3d frequency, for example, when you run cinebench, you do run one or 2 bins higher than the 3.7Ghz that Xeon is supposed to be limited to, at the beginning, then, later, it may drop, depending if your cooling can keep up with the energy produced.
      Then , when you get to AVX2 or AVX512, you have the same mechanism in place, the PCU knows the floor of the respective instruction set, based on how many cores are active, and how dense is your instruction stream.
      Then, The PCU will regulate, if you are using SIMD 512 bits adds, for example, you will not drop to the minimum 2.8Ghz, you will operate around 3.2Ghz (Just tested it)
      IF you add FMA with no dependancies between the 2 FMAs, in 512 bits, you will go down to 2.8Ghz (Just tested it too)

      Those mechanisms are working this way, any other way to look at this is voodoo stuff.
      For the fun of it , a video of me speaking of a top end config of Intel when I was working there. https://www.youtube.com/watch?v=2W_79ZUyYWw

      1. I think we disagree, but it’s hard for me to be certain, because I’m not totally clear on what you are claiming – so it’s possible we don’t disagree at all.

        When you say “so, if your instruction stream does not include any particular instruction set, you are likely to end up higher than the supposed bas3d frequency, for example, when you run cinebench, you do run one or 2 bins higher than the 3.7Ghz that Xeon is supposed to be limited to”, in the reference to “supposed bas3d frequency” are you talking about the turbo frequencies (which are 3.7, 3.6, 3.5 GHz for 1 active core for L0,1,2 respectively for the Gold 6154 you linked to), or are you talking about the “base” frequency of 3.0, 2.6, 2.1 given in the base column?

        If you talking about the latter (“base frequency”) then I think we agree on least that part: the CPU will generally always be running in one of the turbo speeds which is greater than the base frequency. Something would have to quite unusual (configuration-wise or hardware) for the chip to run in a sustained manner only at the base frequency, which is much lower than even the wost-case turbo frequencies.

        If you are talking about the former (turbo frequencies), then your claim is that the chip can run for some period above the turbo frequency for its license, right? That seems remarkable (and in direct contradiction to your previous paragraph where you say these are “hard limits”) – and I’d like to see a reproducible test that shows it! You can start with avx-turbo as a base or do it from scratch, or use existing tools – just make it open and reproducible.

        The only time-based effect I’m aware of that kind of aligns with your description is the ability to exceed the long-term TDP threshold for various time periods, e.g., run up to 140W on a 100W TDP chip for 1 second an up to 125W for 14 seconds or something like that. That’s a separate mechanism and it doesn’t let you go above the turbo frequency matrix, it just lets you exceed the TDP for a short period, effectively using the thermal mass of chip and cooling solution as a buffer to absorb heat above the long-term cooling capability. These values are configurable in the BIOS/firmware. This feature is especially useful and commonly triggered in low-TDP chips like < 40W thin-and-light laptop chips.

        Then, The PCU will regulate, if you are using SIMD 512 bits adds, for
        example, you will not drop to the minimum 2.8Ghz, you will operate
        around 3.2Ghz (Just tested it)

        Can you share your test? Or at least describe the inner loop in terms of what type of adds and how they were dependent. This wouldn’t be surprising – as described above, most AVX-512 kernels will run in the L1 license, which is 3.3 GHz: only if you have a “dense enough” sequence of heavy AVX-512 instructions will you drop to L2. I wouldn’t be surprised to find out that the measurement of “dense enough” depends in a fine-grained way on the actual instructions. I would be surprised in the CPU uses a frequency higher than the turbo frequency for the current licenses and active core count. I would also be surprised if the CPU selects dynamically a frequency lower than the max-turbo frequency, except where max TDP throttling or another type of throttling is occurring. Perhaps if you run the 6154 at 18 active cores, with a heavy AVX-512 load you get TDP throttling, and in that case I’d completely agree that you can see lower frequencies: but I haven’t run into TDP throttling on the server chips I’ve tested (which doesn’t include the 6154) or that other people have tested with avx-turbo and provided the results.

        Those mechanisms are working this way, any other way to look at this
        is voodoo stuff.

        You should be more specific about what you disagree with (and try to use precise language in order to reduce mis-understandings) – but I’m quite convinced it’s not voodoo. This post is just a restatement of how Intel themselves describes this working – both in marking materials and technical documents. They themselves publish the frequency matrices! They invented terms we use here like “license” and they have performance counters which use these terms and show you exactly how many cycles you spend running in each license.

        More importantly, our testing code is completely open and our results can be reproduced by anyone. Our results line up exactly with how Intel describes the system working, and are generally precisely and exactly reproducible. I welcome you to provide reproducible evidence to the contrary (and first to be specific about what you disagree about), rather than vague appeals to authority or youtube links.

  5. I spent a week to collect all public info available for this topics online, to make sure I was not pushing intel confidential info, and I did not:

    I think you do not understand the Turbo Max 3.0 at all … so, I recommend that you go and read the link attached. (see here that Xeon Gold support Turbo 3.0 ( https://ark.intel.com/products/120492/Intel-Xeon-Gold-6130-Processor-22M-Cache-2_10-GHz)

    Turbo 3.0 Max in more detail:

    Then, please read about Speedshift:
    https://www.anandtech.com/show/11550/the-intel-skylakex-review-core-i9-7900x-i7-7820x-and-i7-7800x-tested/7 (supported by Xeon Gold too)

    Then, read this:
    https://www.anandtech.com/show/11550/the-intel-skylakex-review-core-i9-7900x-i7-7820x-and-i7-7800x-tested/7 (the code in this article here was written by my peer fellow Principal Engineer in my group then)

    Now, important point to notice in the definition of turbo boost:

    “Availability and frequency upside of Intel® Turbo Boost Technology 2.0 state depends upon a number of factors including, but not limited to, the following:
    Type of workload
    Number of active cores
    Estimated current consumption
    Estimated power consumption
    Processor temperature”

    AND PLEASE NOTICE “but not limited to” part of it …

    Then, when you are done with this, you have to go and read carefully how does Turbo 2.0 Here: https://www.hotchips.org/wp-content/uploads/hc_archives/hc23/HC23.19.9-Desktop-CPUs/HC23.19.921.SandyBridge_Power_10-Rotem-Intel.pdf

    Then, understand that Turbo 3 and SpeedShift/SpeedSpeed and Turbo2.0 are stack on the top of each other.

    So, AVX512 has the transitional mode that you keep decline to exist. you can run a medium dense instruction AVX512 stream and not end up at the lowest frequency attributed the AVX512 by the frequency table, and this is how it works, and if you do not agree with all of those, well, can’t help you.

Leave a Reply

Your email address will not be published.

You may subscribe to this blog by email.