Memory parallelism: AMD Rome versus Intel

When thinking about “parallelism”, most programmers think about having multiple processors. However, even a single core in a modern processor has plenty of parallelism. It can execute many instructions per cycle and, importantly, it can issue multiple memory requests concurrently.

Our processors are becoming “more parallel” over time, as is evident by the recent increases in the number of cores. AMD sells 64-core processors. But each individual core is also becoming “more parallel”.

To demonstrate, let me use a memory access test.

The starting point is a shuffled array. You access one entry in the array, read the result, and it points to the next location in the array. You get a random walk through an array. The test terminates when you have visited every location in the array.

Then you can “parallelize” this problem. Divide the random walk into two equal-size paths. Or divide it into 10 equal-size paths.

If you can issue only one memory request at any one time, parallelizing the problem won’t help. However, processors can issue more than one memory request and so as you parallelize the problem, your running times get smaller and your effective bandwidth higher.

How has this evolved over time? The very latest Intel processors (e.g., Cannon Lake), can sustain more than 20 memory requests at any one time. It is about twice what the prior generation (Skylake) could do. How do the latest AMD processor fare? About the same. They can sustain nearly 20 concurrent memory requests at any one time. AMD does not quite scale as well as Intel, but it is close. In these tests, I am hitting RAM: the array is larger than the CPU cache. I am also using huge pages.

The important lesson is that if you are thinking about your computer as a sequential machine, you can be dramatically wrong, even if you are just using one core.

And there are direct consequences. It appears that many technical interviews for engineering positions have to do with linked lists. In a linked list, you access the element of a list one by one, as the location of the next entry is always coded in the current entry and nowhere else. Obviously, it is a potential problem performance-wise because it makes it hard to exploit memory-level parallelism. And the larger your list, the more recent your processor, the worse it gets.

I make the raw results available. I use the testingmlp software package.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

17 thoughts on “Memory parallelism: AMD Rome versus Intel”

  1. Very interesting. A lot has changed since I studied algorithmics. It is not easy to keep track of technological advances and their impacts on how to write fast code, even more so when programming is not a central part of one’s job.

    Is the graph from a paper or it is based on an ad-hoc test you made for this blog post specifically?

  2. I somewhat doubt you were testing Cannon Lake…? That’s a broken CPU shipped in very low quantities and by now EOL. Did you mean Ice Lake?

  3. Very informative and the code looks great. In my experience profiling on a SandyBridge CPU, I saw what looked like a memory performance bottleneck. Now I have a good way to compare. It looks like C++11 is a requirement, so will use on newer machines.

  4. How does memory parallelism relate to SIMD parallelism? i.e. Would a SIMD instruction only need 1 read to access a chunk of data.

    On the CPU, are multiple nearby memory requests coalesced into a single read?

    1. Reading into a SIMD register can count issue a single instruction and load more data than is possible with a general-purpose register.

      But it is not clear to me how this relate to the numbers I provide here: you access cache lines (typically 64 bytes) even if you only need 1 byte.

  5. That’s a good result from Zen2. I wonder there was an improvement here over Zen? I thought Zen topped out at around 12-16 but I could be mis-remembering.

    In fact we can’t even tell from that chart where Zen2 (or CNL, probably) even tops out.

    Do I read the chart correctly that all 3 systems have nearly identical single-lane throughput, or was it normalized somehow?

    1. @Travis I posted the raw results. See the end of the post. Not, there was no normalization, this is the computed bandwidth.

      I think that there are differences in single-lane bandwidth, though it is not large. You can see it just with the plot (the lines don’t quite overlap).

      1. Ah, I see the raw results. There is a flaw with using clock() on some systems and it is evident here: the resolution is very low. I had it on my TODO to fix it, but never go around to it I guess.

      2. Unless I am misreading something, the data does not correspond to the chart? E.g., the last three BW values for Rome are all identical (9752) for 23, 24, 25 but the purple line in the chart is clearly different (and seems to be > 9752) as it does not flatline for those points.

          1. I’m working on some updates to the tool that among other things avoid the poor resolution of clock() on CentSO, so the chart won’t be so quantized.

  6. MLP analysis has cracked the mainstream tech press, see for example this Zen review by Andrei and Gavin at AnandTech.

    I like the way the chart is made, across various sizes and normalized to the 1-mlp speed (so the y-axis is “speedup relative to 1-mlp”).

    Here are some more charts in this vein that I made now. Andrei claims that SKX can reach much more than 10, MLP, e.g., based on this chart which shows it hitting speedups of more than 25x, but I have to think this is measurement error. I couldn’t replicate it (admittedly on different-but-stil-SKX hardware).

Leave a Reply

Your email address will not be published. Required fields are marked *

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    Markdown is turned off in code blocks:
     [This is not a link](

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see