Instructions per cycle: AMD Zen 2 versus Intel

The performance of a processor is determined by several factors. For example, processors with a higher frequency tend to do more work per unit of time. Physics makes it difficult to produce processors that have higher frequency.

Modern processors can execute many instructions per cycle. Thus a 3.4GHz processor has 3.4 billion cycles per second, but it might easily execute 7 billion instructions per second on a single core.

Up until recently, Intel produced the best commodity processors: its processors had the highest frequencies, the most instructions per cycle, the most powerful instructions and so forth. However, Intel is increasingly being challenged. One smaller company that is making Intel nervous is AMD.

It has been reported that the most recent AMD processors surpass Intel in terms of instructions per cycle. However, it is not clear whether these reports are genuinely based on measures of instruction per cycle. Rather it appears that they are measures of the amount of work done per unit of time normalized by processor frequency.

In theory, a processor limited to one instruction per cycle could beat a modern Intel processor on many tasks if they had powerful instructions and faster data access. Thus “work per unit of time normalized per CPU frequency” and “instructions per cycle” are distinct notions.

I have only had access to a recent AMD processors (Zen 2) for a short time, but in this short time, I have routinely found that it has a lower number of instructions per cycle than even older Intel processors.

Let us consider a piece of software that has a high number of instructions per cycle, the fast JSON parser simdjson. I use GNU GCC 8 under Linux, I process a test file called twitter.json using the benchmark command line parse. I record the number of instructions per cycle, as measured by CPU counters, in the two stages of processing. The two stages together effectively parse a JSON document. This is an instruction-heavy benchmark: the numbers of mispredicted branches and cache misses are small. The Skylake processor has the highest frequency. I use an AMD Rome (server) processor.

I find that AMD is about 10% to 15% behind Intel.

processor IPC (stage 1) IPC (stage 2)
Intel Skylake (2015) 3.5 3.0
Intel Cannon Lake (2018) 3.6 3.1
Zen 2 (2019) 3.2 2.8

Another problem that I like is bitset decoding. That is given an array of bits (0s and 1s), I want to find the location of the ones. See my blog post Really fast bitset decoding for “average” densities. I benchmark just the “basic” decoder.

void basic_decoder(uint32_t *base_ptr, uint32_t &base, 
  uint32_t idx, uint64_t bits) {
  while (bits != 0) {
    base_ptr[base] = idx + _tzcnt_u64(bits);
    bits = bits & (bits - 1);
    base++;
  }
}

My source code is available.

processor IPC
Intel Skylake (2015) 2.1
Zen 2 (2019) 1.4

So AMD runs at 2/3 the IPC of an old Intel processor. That is quite poor!

Of course, your results will vary. And I am quite willing to believe that in many, maybe even most, real-world cases, AMD Zen 2 can do more work per unit of work than the best Intel processors. However I feel that we should qualify these claims. I do not think it is entirely reasonable for AMD customers to expect better numbers of instructions per cycle on the tasks that they care about, and they may even find lower numbers. AMD Zen 2 does not dominate Intel Skylake, it is more reasonable to expect rough parity.

Further reading: AMD Zen 2 and branch mispredictions

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

62 thoughts on “Instructions per cycle: AMD Zen 2 versus Intel”

  1. The last table shows twice the IPC for the Zen 2, which is in contradiction to your conclusion. Did you swap the two value by any chance?

  2. What is really making Intel nervous is that at each price point, the AMD processor has many more cores than the corresponding Intel processor AND consumes less power (an important issue for server farms).

  3. The second one seems to be a frustrating example of “implementational divergence due to instruction set bloat” on the AMD side IMHO. Looking up instruction throughput on Zen 2 one finds

    bsf / bsr – 3 / 4 cycles on r64
    tzcnt / lzcnt – 0.5 / 1 cycles on r64

    I’d assume that the compiler generates bsf in your benchmark – if it is the one you presented some time ago. So I am surprised that this is “only” 1/2 of Intel IPC for AMD in this case. Replacing bsf with tzcnt might reverse the situation.

  4. What model CPUs did you use for your comparison? Cache levels, clock speed, and many other factors play into CPU performance.

  5. Couple of open questions:
    – were the Spectre and following mitigations applied on both rigs? That can go a long way explaining differences in the ~10/15% range, but not a x2 factor of course

    if the build is CPU specific, counting instructions seems like a weird way to measure performance, since as you mentioned some instructions are a lot wider than others. By this metric an AVX512 build of a given benchmark could give pretty bad results when compared to an SSE build (which is not true with any metric which actually counts in that case, like throughput or perf/watt)
    if the build is not CPU specific, counting instruction throughput is only interesting if this is a close enough optimal build for both, IMO. One could imagine a CPU which is very good at extracting ILP from low performance builds, which would be a nice skill but could be useless in an HPC context for instance.

    you keep mentioning an “old Intel CPU”, but skylake is basically the only available architecture for anything but some thin laptops. So it’s both “old” and “current”, which contributes to making AMD competitive

    this being said I agree with your initial point that “better IPC” claims are not really qualified. I guess that the implicit meaning is “getting more work done per clock cycle”.

    1. another point is that you discard benchmarks which are memory bound, but that goes against some other tests that you did concerning memory requests parallelism for instance. Extracting good IPC in memory starved contexts is also meaningful, right ?

      1. In memory starved contexts, the number of instructions being retired is probably not the measure you care about. Instead, you might want to report the effective bandwidth or something that has to do with the actual bottleneck.

        1. I disagree on that, it is my understanding that the IPC that manage to go through would be a good proxy for job being done, despite the bottleneck. This whole “job being done” is I think the logic behind most of the “IPC” claims around

          1. We can reason about IPC for instruction-dense code. We know what 4.0 instructions per cycle means: it is great. For instruction-dense code, 1.0 is going to be mediocre. Basically, we have a measure of how superscalar (wide) the processor is. Achieving 6 instructions per cycle in real code would be fantastic.

            For memory-bound problems, what would be a good IPC?… is 0.1 instructions per cycle good or bad? I can’t reason about it. I have some idea of what a bandwidth of 10 GB/s in random access means (it is very good).

    2. In the simdjson benchmarks above, the builds are not CPU specific. All CPUs run almost entirely the same instructions. So yes, the number of instructions retired per cycle follows closely the performance per cycle. On a per-cycle basis, in this AVX2-intensive benchmark, AMD comes down under Intel in every way.

      you keep mentioning an “old Intel CPU”, but skylake is basically the
      only available architecture for anything but some thin laptops. So
      it’s both “old” and “current”, which contributes to making AMD
      competitive

      That is true.

      1. In the simdjson benchmarks above, the builds are not CPU specific. All CPUs run almost entirely the same instructions. So yes, the number of instructions retired per cycle follows closely the performance per cycle. Changing the builds (for instance -O3 vs vanilla) would change the instruction mix and throughput, all other things (task and hardware) being equal. So the correct quote is “for a given build, the number of instructions retired per cycle follows closely the performance per cycle.”, which may or may not be a good proxy for absolute performance (see AVX512 for instance)

        1. I’m just a lowly programmer, but the speed of a single gcc compile seems irrelevant to me. Both processors perform small tasks it in the blink of an eye, so if that blink is 10ms or 12ms isn’t usually very significant to me. Small compiles are pretty equivalent in terms of time required for gcc, and trying to extrapolate those small compile results on a single core to a large number of cores is missing the point. What is relevant is if you use a parallel make (make -j) with a large number of source files and cores like 24, 32, or 64, the AMD processor will usually beat the pants off a processor with a lower number of cores. Same with rendering and many other long tasks. That’s significant to me in my wall-clock development time. Sure, sometimes the AMD may be a bit slower on small tasks, but small tasks don’t take long, so a single TR core is usually only fractionally slower and works fine for small tasks anyway. TR is fast enough to game on my PC with maximum settings on almost every title at 1440p and above. Not that I game much, but it’s fine. Same for analytical graphics, they just don’t take that long on today’s 3000 Nvidia GPU’s. I’ll admit that there are some tasks where bleeding edge core speed is important, but Ive noticed I don’t do those things TA’s often as I compile, for example.

      2. Keep in mind that this project (SIMDjson) was extensively tuned on Intel and machines and then just incidentally run on AMD as a comparison. Many choices made based on benchmark results might have gone a different way on the AMD machine, so the Intel specific quirks get built in this way.

        I’m not saying it would reverse the conclusion in this case, but it’s something to remember when testing something that has been carefully tuned.

        1. Keep in mind that this project (SIMDjson) was extensively tuned on Intel and machines and then just incidentally run on AMD as a comparison.

          That’s true, so it is a bias but I submit to you that the same bias exists on highly tuned software out there.

          Furthermore, when people say that AMD Zen 2 has superior IPC, they rarely qualify this statement by saying that it requires tuning or recompiling the software. If that’s a requirement, it should be stated.

          1. Agreed, it is a bias that applies to other software, although I suspect SIMDJson is more highly tuned than the average, so I suggest it applies more in this case.

            I don’t know about higher IPC, but when I say something like “Zen 2 has comparable IPC to Skylake” I don’t mean after recompiling. I just draw that conclusion from broad-based tests performed by others, on existing binaries without recompiling.

            The IPC relationship between two different uarches isn’t constant across benchmarks so “comparable IPC on average across a range of benchmarks” doesn’t translate to “comparable IPC on every benchmark”. Quite the opposite, I’d expect any given benchmark to show an advantage for one platform or the other since they are not the same.

              1. I can’t speak for everyone, but from my part the hype isn’t that Zen 2 has higher IPC than Intel, or has released a better uarch than Intel, but that AMD has something at least roughly comparable, on average and is making it available at prices and core counts that undercut Intel by 50% or more.

                After years of release the Skylake chip under a new name, and increasing the price each time, Intel has slashed many of their new chips by half over the old lines, and core counts on all parts are suddenly shooting up.

                That’s what’s deserving of hype, not big microarchitectural improvements. From a microarchitectural point of view, Zen and Zen 2 are in many ways Skylake (client) clones!

  6. How can I reproduce your results for the second table?

    I went into the 2019/05/03 folder, ran make and ran the resulting ./bitmapdecoding binary and look at the reported “instructions per cycle” value

    I consistently get IPC 1.76 or 1.77 for Intel and 1.43 or 1.44 for AMD Zen 2 (on skylake-x and rome servers, respectively).

    1. I tried on Skylake server (rather than Skylake-X) and got an IPC of 2.00, which is closer but still not 2.8.

      It’s weird there is such a difference between SKL and SKX here.

      1. Some time ago, I revised the post to 2.1 from 2.8. You have access to my skylake-x box.

        $ ./dockerscript.sh
        x86_64
        $uname_p is [x86_64]
        rm -r -f bitmapdecoding bitmapdecoding.s bitmapdecoding_countbranch sanibitmapdecoding
        Sending build context to Docker daemon  4.946MB
        Step 1/6 : FROM gcc:9.1
         ---> c7637321bf71
        Step 2/6 : COPY . /usr/src/myapp
         ---> Using cache
         ---> e09e94f2e750
        Step 3/6 : WORKDIR /usr/src/myapp
         ---> Using cache
         ---> c41c4c3dcc24
        Step 4/6 : RUN make clean
         ---> Using cache
         ---> a9a9b7ea79ba
        Step 5/6 : RUN make
         ---> Using cache
         ---> 4fbe6b324968
        Step 6/6 : CMD ["./bitmapdecoding", "test"]
         ---> Using cache
         ---> 969acefc46b8
        Successfully built 969acefc46b8
        Successfully tagged my-gcc-app:latest
        just scanning:
        bogus
        .matches = 129996 words = 21322 1-bit density 9.526 % 1-bit per word 6.097
        bytes per index = 10.497
        instructions per cycle 3.85, cycles per value set:  0.341, instructions per value set: 1.312, cycles per word: 2.078, instructions per word: 8.001
         cycles per input byte 0.03 instructions per input byte 0.13
        min:    44301 cycles,   170595 instructions,           2 branch mis.,        4 cache ref.,        2 cache mis.
        avg:  57490.1 cycles, 170607.6 instructions,         3.0 branch mis.,    960.9 cache ref.,     75.0 cache mis.
        
        simdjson_decoder:
        Tests passed.
        matches = 129996 words = 21322 1-bit density 9.526 % 1-bit per word 6.097
        bytes per index = 10.497
        instructions per cycle 2.48, cycles per value set:  4.018, instructions per value set: 9.974, cycles per word: 24.499, instructions per word: 60.812
         cycles per input byte 0.38 instructions per input byte 0.95
        min:   522358 cycles,  1296638 instructions,        8171 branch mis.,      254 cache ref.,        0 cache mis.
        avg: 536409.7 cycles, 1296650.8 instructions,     8351.8 branch mis.,   1111.6 cache ref.,      5.7 cache mis.
        
        Tests passed.
        basic_decoder:
        matches = 129996 words = 21322 1-bit density 9.526 % 1-bit per word 6.097
        bytes per index = 10.497
        instructions per cycle 2.06, cycles per value set:  4.499, instructions per value set: 9.283, cycles per word: 27.432, instructions per word: 56.599
         cycles per input byte 0.43 instructions per input byte 0.88
        min:   584913 cycles,  1206795 instructions,       14050 branch mis.,      251 cache ref.,        0 cache mis.
        avg: 592380.1 cycles, 1206795.0 instructions,    14329.2 branch mis.,   1269.4 cache ref.,     79.9 cache mis.
        
  7. Tests seem quite vague to be honest, nothing about the specific processors being used clock speeds cache sizes etc.

    Specific instruction sets being used If using AVX2 workloads Intel would come out on top everytime as it supports AVX 256 or 512 while Ryzen only supports AVX 128 so in that case yes Intel would come out on top and by quite a bit. Intel designed it and although AMD can use it due to their licence agreement but incorporating it takes a long time and typically architecture change and is not something that can just be added. Clock speed boosts vary quite a lot on load, if the load is short Intel will boost to maximum clock speed and never stabilise at the lower frequency after 30 seconds which you would normally see which can invalidate the results along with motherboards that may enable MCE as always on.

    AMD don’t do instruction sets anymore as applications typically geared towards the most common MMX was preferred over AMD 3D Now even if 3D now was more efficient but Intel like it is today has a bigger market thus not worth investing in a specific instruction set when the devices are limited.

    Ryzen architecture boosts vary quite a bit depending on background tasks running it could be 4.6GHz or it could be 4GHz. Due to the nature of the architecture and the way it boosts it can only maintain that frequency for a short time, less than a second and the load is moved over to another core this includes flushing the data from L1 and L2 cache and move over to the to other core then boost the new core at a higher frequency to continue the task this can add latency but is quite small as it would typically be in the same CCX.

    When people go on about Zen 2 having a higher IPC they have it as apples to apples, i.e All CPU’s run at the same clock speed and see what architecture differences there are that distinguish from each at a given clock frequency without random boost clock speeds and use more generalised instruction sets like SSE4 skewing the results. If you have something that boosts randomly due to CPU temps, voltage, current ripple, VRM temps even ambient temperatures. If you can’t set a base line and have so many variables into your results then the end result is also useless

    1. Tests seem quite vague to be honest, nothing about the specific processors being used clock speeds cache sizes etc.

      These are not memory-bound tests. The processor with the highest frequency in these tests is Skylake. Given that memory access is not a significant burden here, and that we report “per cycle” instructions, it is ok not to mention frequency. But if we do, then the Intel Skylake processor is maybe at a disadvantage.

      Specific instruction sets being used If using AVX2 workloads Intel would come out on top everytime as it supports AVX 256 or 512 while Ryzen only supports AVX 128 so in that case yes Intel would come out on top and by quite a bit.

      No, none of this code uses AVX-512.

      If you can’t set a base line and have so many variables into your results then the end result is also useless

      I disagree. I can measure reliably how many cycles a computationally intensive tasks take. Yes, if there are expensive cache misses, then we have an issue, but it is not the case here.

  8. Those influencers online that post stuff without specifying a lot of things. Short boost clock is extremely important in your test and that is what intel cpu’s are good at. If you think that only two tests are important you are wrong again. If it was that huge gap you said about between these cpu’s everybody woulf notice that and intel wouldn’t lower their prices fot selling.

    1. Short boost clock is extremely important in your test and that is what intel cpu’s are good at.

      Short boost in the clock frequency would not be relevant. If anything, as Yoav pointed out in another comment, higher frequencies in Intel would mean lower IPC whenever memory latency is at issue.

      If you think that only two tests are important you are wrong again. If it was that huge gap you said about between these cpu’s everybody woulf notice that and intel wouldn’t lower their prices fot selling.

      Please read my post again. I am explicit is stating that I believe AMD has probably better processors than Intel at this point. All I am saying is that we should qualify these statements.

      1. I think Darien understands that it is the same binary, and is asking a different question. Since you used “-march=native”, you might get a different compilation depending on whether you generated the binary on AMD or Intel. In theory, this binary might always be faster on the machine it is compiled on than on the opposite machine. In practice, the assembly here is straightforward enough that this is unlikely the case. But it’s still a question worth asking, and worth answering.

  9. Frequency is very important when measuring IPC. This is because memory latency doesn’t scale with frequency. So higher frequency means usually lower IPC.
    Also the memory speed is very important.

    1. @Yoav Memory latency is not the issue here. This being said, the Intel Skylake processor has a higher frequency so if there is any frequency-related bias, it would be favorable to AMD Rome.

  10. These results are really odd. The Zen core has much wider decoding and far more pipelines that can complete instructions. Did you optimize for both systems or just the skylake one?

          1. Then you don’t understand microarchitecture. Nehalem has the same core layout and scheduler layout as Skylake and Cannon lake. If it’s optimized for one, it’s optimized for all of them on the base level. Zen has a very different layout for both core and scheduler.

            Skylake has 5 decoders and Zen has 4, but Zen can pack up to 8 instructions for those 4 and Skylake only 6. And that’s just the first example.

            1. The simdjson is an open project and we could use help from someone who can help us optimize the code for Zen architectures. Please help out.

              Even on AMD, it is the fastest JSON parser in existence as far as I know.

            2. Zen and Skylake are way more similar than either of those are to Nehalem.

              In general though compilers favor code that is faster on Intel than AMD.

  11. Who’s claiming Zen 2 has beat Intel on IPC? All the benches Ive seen puts the top mainstream i9 9900 and/or 9700 on top even against the mighty 3950x. The claim I’ve seen is that AMD has narrowed the IPC gap significantly while destroying Intel on multithreaded tasks by a very very large margin. It is my understanding that the IPC gap between Zen2 and 9th gen is small enough that it’s a better value to go with a Zen2 for a more robust CPU if you are a mix usage, content creation and gaming, etc.

    Zen2 is not exactly cheap also. Platform costs (X570), memory cost (higher bandwidth memory) makes it a bit more expensive than an i9. The good news is that you can pop in a Zen2 into an X370 mobo…

    It’s an exciting time for the PC market. competition drives innovation!

  12. This article feels so unfinished.
    Nothing specified about test setup, and answer is always avoided.
    What specs (chipset, CPU, RAM, OS)? What settings for CPU and RAM?
    Why so specific software and such small selection? Maybe inclusion of actual performance (not IPC) at the same clockspeed.

    1. This blog post is specifically about IPC, not performance (please see the title).

      The microarchitectures are specified; RAM and operating systems are not relevant.

      Why so specific software and such small selection?

      Because this is the software I care about. You will undoubtably run different code and software. That’s fine.

  13. Came across this article on HN,

    The problem is Guru3D, they are is not a technical site, so they got IPC wrong. But IPC in everybody’s ( or normal peoples ) terms is exactly that they / you describe, work per clock, so in that sense it is right for their target audience. ( May be it should be used as PPC, performance per clock )

    Having said that even the AMD bias site and fans dont ever claim Zen 2 has better than Intel’s Single Core / Thread / IPC performance. Having better IPC / PPC is absolutely not the mainstream sentiment. As a matter of fact this is the first time I heard about it having casually serving a dozen of tech sites and social media.

    1. See it’s not really the sites that’s at fault, it’t the manufactures that indicate what the “IPC” uplift is and this is before clock speeds are taking into account because they are at the stage in development where clock speeds have not yet been finalised as of yet thus sites like Guru3D etc set a baseline and to confirm if a specific manufacture is accurate in their assessment. Now maybe PPC would be a better assessment and neither AMD or Intel tried to diverge from it.

      Intel typically will say 2% IPC gain before frequency as does AMD, while Intels typical 2% is within margin are error and people are like meh, no one really cares at that point but some people do indeed test it and do typically see a 2% uplift. They do it like this as both Intel and AMD are competitors yet at the same time they want people to be interested in their product while not giving away the full performance that frequency contributes to it.

      People are more interested with AMD since when Zen was first launched 52% over piledriver, Zen+ 5%, Zen 2 13% and the upcoming Zen 3 15% with these percentages from Zen+ being compared to first Gen Zen, these are sizeable gains outside of margin of error. Then of course Intel and their Ice Lake’s 18% IPC uplift. How sites test the devices to determine if they are true or not, as in the past they have been over inflated but thus far it has been accurate at least for the Zen microarchitecture. Thus this is how a majority of people how/now understand what IPC is.

  14. I’m not necessarily a mega AMD fanboi and tend to suspect that the underlying point may still hold up. But this article feels very dubious since it carefully selected to benchmarks that would specifically be MOSTLY using AVX512 instructions on Intel, which AMD doesn’t have implemented. I’d like to see the performance difference using only x86-64 instructions with no SSE/AVX

  15. Since tech sites compare Intel/AMD CPU performance, they are talking about the same binary file (and the same list of instrunctions to be executed), running on different x86-64 compatible CPUs.

    In this specific but very common scenario, for any benchmark of fixed amount of work/instructions, “work per unit of time normalized per CPU frequency” = constant “instruction count of work” * “instructions per cycle”. These two work the same way hen comparing architectures.

    So I suppose tech sites are not wrong. You and they just use different benchmarks and get different results.

    1. If you present a plot where on the y-axis you claim to present the number of instructions per cycle and you give some other number, you are making a mistake.

      For many benchmarks, the number of instructions is not the proper benchmark.

      Furthermore, even if you have the same binary, there is no reason to think that the processors will execute the same instructions. Branch predictors, difference in ISA can all trigger different code paths.

  16. From where I stand, work per unit of time normalized per CPU frequency, is the only consequential measure of IPC. IPC numbers offered by manufacturer only tell us about whether later generations of CPU achieve higher IPC than earlier generations of CPU. That is a relative measure only. It is nice to know that IPC is improving over successive years by this or that percentage but putting meat to the bones of IPC involves determining what those instructions are worth. You can only discover that by running appropriate benchmarks. And, that is why I say that work per unit of time normalized per CPU frequency is the more substantial way of thinking about IPC, whereas manufacturers numbers hold less significance.

Leave a Reply to Benjamin Cancel reply

Your email address will not be published.

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

You may subscribe to this blog by email.