One of the most expensive operation in a processor and memory system is a random memory access. If you try to read a value in memory, it can take tens of nanosecond on average or more. If you are waiting on the memory content for further action, your processor is effectively stalled. While our processors are generally faster, the memory latency has not improved quickly and the latency can even be higher on some of the most expensive processors. For this reason, a modern processor core can issue multiple memory requests at a given time. That is, the processor tries to load one memory element, keeps going, and can issue another load (even though the previous load is not completed), and so forth. Not long ago, Intel processor cores could support about 10 independent memory requests at a time. I benchmarked some small ARM cores that could barely issue 4 memory requests.
Today, the story is much nicer. The powerful processor cores can all sustain many memory requests. They support better memory-level parallelism.
To measure the performance of the processor, we use a pointer-chasing scheme where you ask a C program to load a memory address which contains the next memory address and so forth. If a processor could only sustain a single memory request, such a test would use all available ressources. We then modify this test so that we have have two interleaved pointer-chasing scheme, and then three and then four, and so forth. We call each new interleaved pointer-chasing component a ‘lane’.
As you add more lanes, you should see better performance, up to a maximum. The faster the performance goes up as you add lane, the more memory-level parallelism your processor core has. The best Amazon (AWS) servers come with either Intel Ice Lake or Amazon’s very own Graviton 3. I benchmark both of them, using a core of each type. The Intel processor has the upper hand in absolute terms. We achieve a 12 GB/s maximal bandwidth compared to 9 GB/s for the Graviton 3. The one-lane latency is 120 ns for the Graviton 3 server versus 90 ns for the Intel processor. The Graviton 3 appears to sustain about 19 simultaneous loads per core against about 25 for the Intel processor.
Thus Intel wins, but the Graviton 3 has nice memory-level parallelism… much better than the older Intel chips (e.g., Skylake) and much better than the early attempts at ARM-based servers.
The source code is available. I am using Ubuntu 22.04 and GCC 11. All machines have small page sizes (4kB). I chose not to tweak the page size for these experiments.
Prices for Graviton 3 are 2.32 $US/hour (64 vCPU) compared to 2.448 $US/hour for Ice Lake. So Graviton 3 appears to be marginally cheaper than the Intel chips.
When I write these posts, comparing one product to another, there is always hate mail afterward. So let me be blunt. I love all chips equally.
If you want to know which system is best for your application: run benchmarks. Comprehensive benchmarks found that Amazon’s ARM hardware could be advantageous for storage-intensive tasks.
Further reading: I enjoyed Graviton 3: First Impressions.
5 thoughts on “Memory-level parallelism : Intel Ice Lake versus Amazon Graviton 3”
Everything has a downside, though: Sometimes, the CPU does not know yet whether it needs to access certain memory (e. G. if the code is behind a conditional branch whose result depends on other memory which is still being in progress).
Thus, the CPU speculatively reads the memory anyways, knowing the read may be wasted, but gaining speed in the other case.
As the world is not ideal, those speculative reads have some side effects (e. G. a different timing of later reads because the data may be in the cache or not). And this has been exploited, google for “meltdown” and “spectre” vulnerabilities.
(The explanation here is simplified.)
Hi Daniel, the Graviton 3 uses DDR5 RAM, and the Ice Lake uses DDR4. How do you think that shapes these results?
Interesting bit here: “Out in memory, Graviton 3 noticeably regress in latency compared to Ampere Altra and Graviton 2. That’s likely due to DDR5, which has worse latency characteristics than DDR4. Graviton 3 also places memory controllers on separate IO chiplets. That could exacerbate DDR5’s latency issues.”
I remember this paper on page sizes. I wonder what the impact would be on this kind of pointer chasing test:
P. Weisberg and Y. Wiseman, “Using 4KB page size for Virtual Memory is obsolete,” 2009 IEEE International Conference on Information Reuse & Integration, 2009, pp. 262-265, doi: 10.1109/IRI.2009.5211562.
My post has this comment: “I chose not to tweak the page size for these experiments.” We know that increasing the page size would improve matters. I chose not to play with it.
I would agree that 4kB should be obsolete, but it is not up to me to change the default.
Regarding DDR5… this might very well be a factor in the random-access bandwidth. It is likely that the Intel servers have mature DDR4 with low latency.
Yes, I was responding to your comment about 4 KB pages in my comment on the issue. I should’ve made that clear. Since it’s just a software setting, I’m not sure why we’d care about defaults. Lots of people run Linux servers with Large Pages or the giant or jumbo pages setting. I’m not sure if exact page sizes can be set in Linux, Windows Server, and FreeBSD, but those researchers found 16 KB to be the sweet spot.
I’m not sure why we’d care about defaults.
My expectation is that most people adopt the defaults, whatever their operating system is. So, yes, I care about the default for page sizes and I rarely change them myself.
You may subscribe to this blog by email.