Memory-level parallelism: Intel Skylake versus Intel Cannonlake

All programmers know about multicore parallelism: your CPU is made of several nearly independent processors (called cores) that can run instructions in parallel. However, our processors are parallel in many different ways. I am interested in a particular form of parallelism called “memory-level parallelism” where the same processor can issue several memory requests. This is an important form of parallelism because current memory subsystems have high latency: it can take dozens of nanoseconds or more between the moment the processor asks for data and the time the data comes back from RAM. The general trend has not been a positive one in this respect: in many cases, the more advanced and expensive the processor, the higher the latency. To compensate for the high latency, we have parallelism: you can ask for many data elements from the memory subsystems at the same time.

In earlier work, we showed that current Intel processors (Skylake microarchitecture) are limited to about ten concurrent memory requests whereas Apple’s A12 processor scale to 40 or more memory requests.

Intel just released a more recent microarchitecture (cannonlake) and we have been putting it to the test. Is Intel improving?

It seems so. In a benchmark where you randomly access a large array, using a number of separate paths (which I call “lanes”), we find that the cannonlake processor appears to support twice as many concurrent memory requests as the skylake processors.

The Skylake processor has lower latency (70 ns/query) compared to the Cannonlake processor (110 ns/query). Nevertheless, the Cannonlake is eventually able to beat the Skylake processor in bandwidth by a wide margin (12 GB/s vs 9 GB/s).

The story is similar to the Apple A12 experiments.

This suggests that even though future processors may not have lower latency when accessing memory, we might be better able to hide this latency through more parallelism.

Even if you are writing single-threaded code, you ought to think more and more about parallelism.

Our code is available.

Credit: Though all the mistakes are mine, this is joint work with Travis Downs.

Further details: Processors access the memory through pages. By default, many Intel systems have “small” pages (4kB). When doing random accesses in large memory regions, you are likely to access too many pages, so that you incur expensive “page misses” that lead to “page walks”. It is possible to use large page sizes, even “huge pages”. But since memory is allocated in pages and you may end up with many under-utilized pages if they are too large. In practice, under-utilized pages (sometimes called “memory fragmentation) can be detrimental to performance. To get the good results above, I use huge pages. Because there is just one large memory allocation in my tests, memory fragmentation is not a concern. With small pages, the Cannonlake processor loses its edge over Skylake: they are both limited to about 9 concurrent requests. Thankfully, on Linux, programmers can request huge pages with a madvise call when they know it is a good idea.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ).

8 thoughts on “Memory-level parallelism: Intel Skylake versus Intel Cannonlake”

    1. What Skylake CPU was tested?

      The Skylake microarchitecture is the last one we have had in a long time. All the recent Intel processors are based on Skylake.

      What RAM types were used in both systems?

      The skylake box has DDR4 (2133 MHz), the cannonlake box has LPDDR4 (3200 MT/s).

  1. The skylake box has DDR4 (2133 MHz), the cannonlake box has LPDDR4 (3200 MT/s).

    Is there any way you could test using the same memory types? I would imagine that the bus differences, etc. between lpddr4 & desktop DDR4 could account for the latency difference.

  2. I just realized Skylake doesn’t support LPDDR4 and Cannonlake doesn’t support normal DDR4. Welp, I guess we’ll be waiting for Sunny Cove before my question is answered.

  3. It the bioses allow you can still set bandwidth and latency #s to the same to get a more equal comparison. Lp vs non lp doesn’t really matter ..

  4. I went back to your previous stories, but it doesn’t seem this is addressed conclusively. Do you believe this is a feature of the core, or the memory controller? I guess mainly the core at these levels, but it would for sure be interesting to know when one would saturate the higher core count Xeons for the pathological cases of essentially random accesses into very large data structures. My empirical results on a “real” application so far indicate huge benefits from prefetching, huge pages and hyperthreading combined, but I would guess the code in practice maybe is able to maintain 2-4 actual independent lanes per core in this config.

    1. The short answer is “all of the above”.

      All parts of the path to memory play a part in the observed parallelism.

      For example, the core itself must have some number of “miss handling registers” to support multiple outstanding misses in L1 – otherwise, there could be no parallelism at the core level.

      Further along the path, the “uncore”, memory controller and DRAM itself support varying degrees of parallelism, all of which interact to produce the observed parallelism in this type of benchmark and also of course for real world code.

      Note that the parallelism isn’t simply the part of the path that itself supports the smallest parallel factor – it’s more complicated than that, since each part of the path has a different “occupancy time” – the shorter the time, the lower the required parallelism for a given occupancy level. For example, DRAM itself has fairly low intrinsic parallelism: after all, at the physical level there is only a single set of address and data busses etched on the motherboard per memory channel, and at most one thing can be passing over those busses at any given moment. Even there, however, you have a type of parallelism inside the DRAM chips which can have multiple open pages and accessing an open page is faster than a closed one.

      Backing up to the memory controller, these generally support many parallel requests. I don’t have an exact figure, but manuals for older Xeons indicated 32 requests per controller and I don’t think that figure has gotten any smaller. At the memory controller level parallelism helps in two ways: (1) the obvious way, which is the same for other parts of the path to memory, where it allows more requests in parallel increasing the total throughput via MLP and also (2) by having many requests visible to the controller at once, they can be rearranged so the DRAM is accessed in a pattern more friendly to the underlying hardware, e.g., accessing more open pages as discussed above.

      Finally, between the core and the memory controller you have the so-called “superqueue” which covers the path approximately from L2 to L3, and is thought to have 16 or so entries (but it is likely more now on CNL since we see a higher observed MLP factor).

      As far as multiple cores go, using many changes things dramatically. You can basically break the path above up into the “core private” and “shared” components per socket. The superqueue and everything closer to the core is private, and the L3/CHA, memory controller and DRAM are shared. Usually the shared resources aren’t the bottleneck for a single core, since they are sized for multiple cores – but once you get a few cores running at maximum required bandwidth, the shared components will become a bottleneck and the achievable per-core MLP will drop. The detailed be of the core-private stuff is usually the same with a micro-architecture, even across client and server parts, but the shared stuff varies a lot, all the way down to the specific characteristics and number of DIMMs you are using.

Leave a Reply

Your email address will not be published.

To create code blocks or other preformatted text, indent by four spaces:

    This will be displayed in a monospaced font. The first four 
    spaces will be stripped off, but all other whitespace
    will be preserved.
    
    Markdown is turned off in code blocks:
     [This is not a link](http://example.com)

To create not a block, but an inline code span, use backticks:

Here is some inline `code`.

For more help see http://daringfireball.net/projects/markdown/syntax

You may subscribe to this blog by email.